This is a completely rewritten parser for tsearch2 with full UTF8 support. Parser uses finite-state automata technique and expected to be flexible and compatible with old tsearch2 parser (fixed some errors).
A list of current issues in parser (available from CVS HEAD).
Multiple consecutive slashes ('////'): broken
test=# select * from parse('~//downloads////qq'); tokid | token -------+------------ 12 | ~ 12 | / 19 | /downloads 12 | / 12 | / 12 | / 19 | /qq (7 rows)
We consider '_' as space symbol
test=# select * from parse('a_b_c'); tokid | token -------+------- 1 | a 12 | _ 1 | b 12 | _ 1 | c
XHTML tag: broken (FIXED)
test=# select * from parse('<br/>'); tokid | token -------+------- 12 | < 1 | br 12 | />
word…: broken (FIXED)
test=# select * from parse('etc...'); tokid | token -------+------- 19 | etc.. 12 | .
~ in path: broken (FIXED)
test=# select * from parse('~/downloads/Harry_Potter.avi'); tokid | token -------+----------------------------- 12 | ~ 19 | /downloads/Harry_Potter.avi
version: broken (FIXED)
test=# select * from parse('-1.2.3'); tokid | token -------+------- 20 | -1.2 12 | . 22 | 3
but see below:
test=# select * from parse('version-1.2.3'); tokid | token -------+--------------- 15 | version-1.2.3 11 | version 12 | - 8 | 1.2.3
Backslash(\) handling: broken (BRR)
select * from parse('a \ b '); tokid | token -------+------- 1 | a 12 | 1 | b 12 |