; tsquery concatenations
tsquery AND tsquery
test=# select 'a&b'::tsquery && 'c|d'::tsquery; ?column? --------------------------- 'a' & 'b' & ( 'c' | 'd' )
tsquery OR tsquery
test=# select 'a&b'::tsquery || 'c|d'::tsquery; ?column? --------------------------- 'a' & 'b' | ( 'c' | 'd' )
sed -e 's|$libdir|/path/to/new/tsearch2/installation/|g' < tsearch2.sql|psql your_DB
select show_curcfg(); show_curcfg ------------- 2407575 (1 row)
Human-readable configuration (instead of oid) could be obtained using:
select * from pg_ts_cfg where oid=show_curcfg(); ts_name | prs_name | locale -----------------+----------+-------------- default_russian | default | ru_RU.KOI8-R (1 row)
# select * from pg_ts_cfg; ts_name | prs_name | locale -----------------+----------+-------------- default | default | C default_russian | default | ru_RU.KOI8-R simple | default | (3 rows)
Server's locale could be found using show all command (see values of lc_collate and lc_ctype).
Current configuration (valid for current session only) used by tsearch2 functions could be specified using set_curfg(pg_ts_cfg.ts_name):
# select set_curcfg('default'); set_curcfg ------------ (1 row)
Also, you may explicitly specify configuration name:
test=# select to_tsquery('default','busy&men'); to_tsquery ---------------- 'busi' & 'men' (1 row) test=# select to_tsquery('simple','busy&men'); to_tsquery ---------------- 'busy' & 'men' (1 row)
test=# update tt set ti=\ test=# setweight( to_tsvector(title), 'A' ) ||\ test=# setweight( to_tsvector(keyword), 'B' ) ||\ test=# setweight( to_tsvector(abstract), 'C' ) ||\ test=# setweight( to_tsvector(body), 'D' );
Later, you may specify in ranking function actual weights to be used to sort search results. Default weights are {0.1, 0.2, 0.4, 1.0}, so title is the most important part (A) of document.
Notice:
You may explicitly specify weights when calling ranking function:
test=# select title,rank('{0.1,0.2,0.4,1.0}',ti,\ test=# to_tsquery('supernovae & stars')) as rank from apod\ test=# where ti @@ to_tsquery('supernovae & stars') order by rank desc;
If you want something other, you could use length(tsvector) as a document length.
apod=# update apod set ti=setweight( to_tsvector(title), 'A' ) || to_tsvector(body); apod=# \d apod Table "public.apod" Column | Type | Modifiers --------+----------+----------- id | integer | title | text | body | text | ti | tsvector | Indexes: tiidx gist (ti)
To search only in titles I used select below:
apod=# select title from apod where ti @@ to_tsquery('supernovae:a & stars');
It's possible to specify several parts of documents to be searched, suppose we have constructed tsvector from 4 parts - a,b,c,d. To search in parts 'a', 'b' use:
to_tsquery('supernovae:ab & stars');
select id,headline(body,q),rank(ti,q) as rank from apod, to_tsquery('stars') q where ti @@ q order by rank desc limit 10;
2nd query with subselect:
select id,headline(body,q),rank from ( select id,body,q, rank(ti,q) as rank from apod, to_tsquery('stars') q where ti @@ q order by rank desc limit 10) as foo;
2nd query could be by order of magnitude faster than 1st one ! This is because of slow headline() function, which processes a document (and so needs to read document from disk), not just an index, so 1st query has to produce as many headlines as the number of search results, while 2nd query needs to read only 10 documents. That makes the difference !
apod=# select set_curcfg('default');
apod=# select * from pg_ts_cfg where oid=show_curcfg(); ts_name | prs_name | locale ---------+----------+-------- default | default | C
apod=# select * from ts_debug('Tsearch module for PostgreSQL 7.3.3'); ts_name | tok_type | description | token | dict_name | tsvector ---------+----------+-------------+------------+-----------+-------------- default | lword | Latin word | Tsearch | {en_stem} | 'tsearch' default | lword | Latin word | module | {en_stem} | 'modul' default | lword | Latin word | for | {en_stem} | default | lword | Latin word | PostgreSQL | {en_stem} | 'postgresql' default | version | VERSION | 7.3.3 | {simple} | '7.3.3'
Here:
General rule: For each lexeme type it's possible to specify which dictionaries and in what order, will be used to process lexeme. This information is stored in table pg_ts_cfgmap. For example:
......skip......................................... default_russian | lword | {astrosyn,en_stem} default_russian | lpart_hword | {astrosyn,en_stem} default_russian | lhword | {astrosyn,en_stem}
Note, that once lexeme gets recognized by one dictionary it will not pass to the next dictionary ! So, it's meaningless to specify {simple, en_stem}, or {en_stem, astrosyn}, because 'simple' and 'en_stem' recognize any lexeme by definition.
Synonym dictionaries is a file containing 'word synonym' pairs. For example, I want to change words 'supernovae', 'supernova', 'sne' to 'sn'. I created file 'astrosyn':
supernovae sn supernova sn sne sn
Then, I registered synonym dictionary using existing template:
apod=# insert into pg_ts_dict\ apod=# (select 'astrosyn',dict_init, 'path', dict_lexize, 'AstroSyn' from pg_ts_dict where dict_name='synonym');
and specify dictionaries to be used for english words ( I use 'default_russian' configuration ):
apod=# update pg_ts_cfgmap set dict_name='{astrosyn, en_stem}' where\ apod=# ts_name='default_russian' and tok_alias in \ apod=# ('lword', 'lpart_hword','lhword' );
English words will be first pass to 'astrosyn' dictionary and then, iff it will not recognized, will pass to 'en_stem' dictionary.
That's all:
apod=# select to_tsvector('default_russian','supernovae stars'); to_tsvector ----------------- 'sn':1 'star':2 (1 row)
apod=# \d apod Table "public.apod" Column | Type | Modifiers --------+----------+----------- id | integer | title | text | body | text | ti | tsvector | Indexes: tiidx gist (ti) apod=# select count(*) from apod; count ------- 4227 (1 row)
Top10 most frequent words:
apod=# select * from stat('select ti from apod') order by ndoc desc, nentry\ apod=# desc,word limit 10; word | ndoc | nentry --------+------+-------- year | 1283 | 2064 star | 1273 | 3908 imag | 1267 | 1967 light | 1232 | 2206 pictur | 1177 | 1442 earth | 1059 | 1798 visibl | 992 | 1259 bright | 936 | 1335 котор | 903 | 1441 эт | 879 | 1397 (10 rows)
Word year appears 2064 times in 1283 documents. This select is quite slow, so you may want to save result into separate table using SELECT INTO for further playing.
It's possible to obtain word statistics on specific parts of documents by passing second parameter to 'stat' function. Above example, but for words in titles:
apod=# select * from stat('select ti from apod','a') order by ndoc desc, nentry desc,word limit 10;
To get stats on several parts use concatenation ('ad').
select * from token_type(); tokid | alias | descr -------+--------------+----------------------------------- 1 | lword | Latin word 2 | nlword | Non-latin word 3 | word | Word 4 | email | Email 5 | url | URL 6 | host | Host 7 | sfloat | Scientific notation 8 | version | VERSION 9 | part_hword | Part of hyphenated word 10 | nlpart_hword | Non-latin part of hyphenated word 11 | lpart_hword | Latin part of hyphenated word 12 | blank | Space symbols 13 | tag | HTML Tag 14 | http | HTTP head 15 | hword | Hyphenated word 16 | lhword | Latin hyphenated word 17 | nlhword | Non-latin hyphenated word 18 | uri | URI 19 | file | File or path name 20 | float | Decimal notation 21 | int | Signed integer 22 | uint | Unsigned integer 23 | entity | HTML Entity (23 rows)
Hi guys, I'm CCing to Andrew, because he is working on improving tsearch2 documentation and I think he would be interesting in some details. Marcus, I agree docs on stop-words is a bit vague, but you could see default configuration (see tsearch2.sql) - it's already specify stop-word list for english and russian languages. insert into pg_ts_dict select 'en_stem', 'snb_en_init(text)', '/usr/local/pgsql/share/contrib/english.stop', 'snb_lexize(internal,internal,int4)', 'English Stemmer. Snowball.' ; Andrew, could, please, you mention this in docs. Now, I'm going to describe how dictionaries treat stop words - they do it in *different* way and this is a source of confusion: ispell - normalize word and then lookup normalized form in stop-word file, stemmer - lookup word in stop-word file and then does it job I think you have enough information to decide what you have to do :) Andrew, this should be described in details with examples: a) only stemmer b) only ispell c) ispell,stemmer
In principle, you have to use different stop-files for different dictionaries. Below is a simple shell script, which shows how to get stop words for english stemmer from ispell's one. Hint: database SOMEDB should have en_stem_no_stopwords dictionary installed.
insert into pg_ts_dict (SELECT 'en_stem_no_stopwords', dict_init, '', dict_lexize, 'english stemmer without stop words' FROM pg_ts_dict WHERE dict_name = 'en_stem');
cat stop-words-file-for-ispell |\ awk " { print \"select lexize('en_stem_no_stopwords','\"\$1\"');\" }" |\ psql SOMEDB -P 'tuples_only'|\ grep -v '^$' |\ tr -d ' {}' > stop-words-file-for-stemmer
From my posting to -hackers list:
Hmm, default configuration is too eager, you index every lexem using simple dictionary ! Probably, it's too much. Here is what I have for my russian configuration in dictionary database:
default_russian | lword | {en_ispell,en_stem} default_russian | lpart_hword | {en_ispell,en_stem} default_russian | lhword | {en_ispell,en_stem} default_russian | nlword | {ru_ispell,ru_stem} default_russian | nlpart_hword | {ru_ispell,ru_stem} default_russian | nlhword | {ru_ispell,ru_stem}
Notice, I index only russian and english words, no numbers, url, etc. You may just delete unwanted rows in pg_ts_cfgmap for your configuration, but I'd recommend just update them setting dict_name to NULL. For example, to avoid indexing integers:
voc=# update pg_ts_cfgmap set dict_name=NULL where ts_name='default_russian' and tok_alias='int'; voc=# select token,dict_name,tok_type,tsvector from ts_debug('Do you have +7000 bucks'); token | dict_name | tok_type | tsvector --------+---------------------+----------+---------- Do | {en_ispell,en_stem} | lword | you | {en_ispell,en_stem} | lword | have | {en_ispell,en_stem} | lword | +70000 | | int | bucks | {en_ispell,en_stem} | lword | 'buck'
Only 'bucks' gets indexed :)
Since built-in ranking functions doesn't use global information it's very easy to scale full-text search distributing database over several computers and use dblink to execute selects, union search results and order them by rank.
=# select dblink_connect('apod','dbname=apod'); =# select * from ( select from dblink('apod', 'select title, rank_cd(fts,q) as rank from apod, to_tsquery(''crab'') q where fts @@ q order by rank desc limit 10' ) as t1(title text, rank real) union all select title, rank_cd(fts,q) as rank from apod, to_tsquery('crab') q where fts @@ q order by rank desc limit 10 ) as foo order by rank desc limit 10;
=# select * from gist_print('a_gist_key') as t(level int, a gtsvector) order by level asc; level | a -------+--------------------------------- 1 | 2016 true bits, 0 false bits 1 | 2016 true bits, 0 false bits 1 | 2016 true bits, 0 false bits 2 | 2009 true bits, 7 false bits 2 | 2004 true bits, 12 false bits 2 | 2010 true bits, 6 false bits 2 | 2003 true bits, 13 false bits 2 | 2008 true bits, 8 false bits 2 | 2003 true bits, 13 false bits 2 | 2005 true bits, 11 false bits 2 | 2005 true bits, 11 false bits ................................... 4 | 23 unique words 4 | 41 unique words 4 | 1092 true bits, 924 false bits 4 | 1003 true bits, 1013 false bits 4 | 982 true bits, 1034 false bits 4 | 57 unique words 4 | 62 unique words 4 | 537 true bits, 1479 false bits
This is an example of bad tree, there are many very long documents (books), so many signatures are degenerated and not useful. root's signatures are degenerated and signatures of 2nd level are also not very good, 4-th level is a leaf-level. Documents is represented as a bit-signature, but sometimes we use space optimization and store words itself, like '23 unique words'.
To do phrase searching just add an additional WHERE clause to your query: SELECT id FROM tab WHERE ts_idx_col @@ to_tsquery('history&lesson') AND text_col ~* '.*history\\s+lesson.*'; The full-text index will still be used, and the regex will be used to prune the results afterwards.
Short notice about using dictionaries with tsearch2.
The purpose of any dictionaries in search engines is to help people to search words not bothering about different forms (declension, inflexion,…) Dictionaries could be used to process query as well as in the indexing. You may store original form of a word and/or it's stem. Most complete index stores both variants and could provide exact search, but at index's size cost. Historically, since tsearch2 was based on gist storage, which is quite sensitive to the number of unique words, so we store only stems. This might be changed in future, since now we could use inverted index with tsearch2.
ISpell dictionary is a (open-source) way to find word's stem(s), their quality is very different for different languages. We use russian ispell dictionary and found it rather useful. Of course, since real language is much complex than ispell rules, there are errors, which produce "noise" in search results. Ispell dictionary could return several normal forms for one word, for example, booking has two infinitives - booking and book.
Ispell dictionary support many ways of word building, but are difficult to build and support. That's why various stemming algorithms become popular, read http://snowball.tartarus.org/texts/introduction.html for good introduction We chose snowball stemmer since it's open-source and written/supported by well-known Martin Porter.
For each lexeme class there is a configurable dictionary queue (in pg_ts_cfgmap). Lexeme passes through this queue until it recognized by some dictionary (currently, there is no possibility to recognize lexeme and pass it to the next dictionary). It's tenable to begin from very specific dictionary (topic related, synonym), and finish queue with most common dictionary like 'simple' or 'stemmer', which recognize everything :)
Specific configuration is very depends on the language, availability of good dictionaries and the goals of search engine. Snowball stemmer works good for english language, since word formation is mostly suffix-oriented (I might be wrong here !), so having not good ispell dictionary, one could use just snowball stemmer. On the other side, for russian language we have good ispell dictionary, which is actively developed and supported, and russian word building is quite complex, so we definitely recommend to use ispell dictionary before snowball stemmer.
It's quite difficult to index mix of several languages which share common characters, since there is no possibility to recognize language. I'd definitely warn you against using stemmer except at the very end of queue, since it recognizes everything and no dictionaries after it will be utilized. Hopefully, any useful text shoud have only one main language. If, for example, the main language is French and second one - English, I'd use French Ispell, English Ispell, French stemmer.