Status: Commited to CVS HEAD
Funded by Georgia Public Library Service and LibLime, Inc.Theasaurus - is a collection of words with included information about the relationships of words and phrases, i.e., broader terms (BT), narrower terms (NT), preferred terms, non-preferred, related terms,etc.
Basically,thesaurus dictionary replaces all non-preferred terms by one preferred term and, optionally, preserves them for indexing. Preserving NPT allows to use relationships (BT, NT) at query time. Thesaurus used when indexing, so any changes in thesaurus require reindexing ( don't confuse with query rewriting, which used in a query stage and rules could be changed online without reindexing ).
Tsearch2's thesaurus dictionary (TZ) is an extension of synonym dictionary to support phrases. We were able to introduce a new dictionary type and preserve compatibility with old interface. Technically, TZ maintains it's state and interacts with parser. TZ uses subdictionary (should be defined in tsearch2 configuration) to normalize thesaurus text. It's possible to define only one dictionary. Notice, that subdictionary produces an error, if it couldn't recognize word. In that case, you should remove definition line with this word or teach subdictionary to know it. There are dictionaries, which always recognize any words ('simple', 'stemmers').
Stop-words recognized by subdictionary replaced by 'stop-word placeholder', i.e., stored a position of word while exact stop-word is not important. To break possible ties thesaurus applies the last definition. For example, consider thesaurus (with simple subdictionary) rules with pattern 'swsw' ('s' designates stop-word and 'w' - known word):
a one the two : swsw the one a two : swsw2
Words 'a' and 'the' are stop-words defined in the configuration of a subdictionary. Thesaurus considers texts 'the one the two' and 'that one then two' as equal and will use definition 'swsw2'.
As a normal dictionary, it should be assigned to the specific lexeme types. Since TZ has a capability to recognize phrases it must remember its state and interact with parser. TZ use these assignments to check if it should handle next word or stop accumulation. Compiler of TZ should take care about proper configuration to avoid confusion. For example, if TZ is assigned to handle only lword lexeme, then TZ definition like ' one 1:11' will not works, since lexeme type digit doesn't assigned to the TZ.
Thesaurus is a plain file of the following format:
# this is a comment sample word(s) : indexed word(s) ...............................
tsearch2 comes with thesaurus template, which could be used to define new dictionary:
INSERT INTO pg_ts_dict (SELECT 'tz_simple', dict_init, 'DictFile="/path/to/tz_simple.txt",' 'Dictionary="en_stem"', dict_lexize FROM pg_ts_dict WHERE dict_name = 'thesaurus_template');
Here:
Now, it's possible to use tz_simple in pg_ts_cfgmap, for example:
update pg_ts_cfgmap set dict_name='{tz_simple,en_stem}' where ts_name = 'default_russian' and tok_alias in ('lhword', 'lword', 'lpart_hword');
tz_simple:
one : 1 two : 2 one two : 12 the one : 1 one 1 : 11
To see, how thesaurus works, one could use to_tsvector, to_tsquery or plainto_tsquery functions:
=# select plainto_tsquery('default_russian',' one day is oneday'); plainto_tsquery ------------------------ '1' & 'day' & 'oneday' =# select plainto_tsquery('default_russian','one two day is oneday'); plainto_tsquery ------------------------- '12' & 'day' & 'oneday' =# select plainto_tsquery('default_russian','the one'); NOTICE: Thesaurus: word 'the' is recognized as stop-word, assign any stop-word (rule 3) plainto_tsquery ----------------- '1'
If you add NPT to the PT in TZ, then resulted tsvector will contain NPT as well as PT, and it could be used for searching also NPT, not just PT, using different tsearch2 configuration.
supernovae stars:supernovae stars SN crab nebulae: crab nebulae SN
Searching for 'supernovae stars' or 'crab nebulae' with TZ support will find the same set of documents, since both queries will be converted to 'SN'. At the same time, using configuration without TZ support, it's possible to use the same tsvector to search for 'supernovae stars'. This looks cumbersome, but one could use different tools for easy building of TZ from scratch.