Find documents, which satisfy query and optionally return them in some order. Most common case: Find documents containing all query terms and return them in order of their similarity to the query.
Document can be:
Document must be identified by some unique key.
Text search operators exist for years.
The idea is rather simple - preprocess document at index time to save time at search stage.
Preprocessing:
Tsearch2 - is the full text engine for PostgreSQL, fully integrated into database. Main features (new in 8.2 are bolded):
Tsearch2, in a nutshell, provides FTS operator (contains) for two new data types, which represent document and query.
We introduced two new data types and operator for FTS
tsvector @@ tsquery tsquery @@ tsvector
@@ operator returns TRUE if tsvector contains tsquery.
=# select 'cat & rat':: tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::tsvector; ?column? ---------- t =# select 'fat & cow':: tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::tsvector; ?column? ---------- f
=# select 'a fat cat sat on a mat and ate a fat rat'::tsvector; tsvector ---------------------------------------------------- 'a' 'on' 'and' 'ate' 'cat' 'fat' 'mat' 'rat' 'sat'
Notice, that space is also lexeme !
=# select 'space '' '' is a lexeme'::tsvector; tsvector ---------------------------------- 'a' 'is' ' ' 'space' 'lexeme'
=# select 'a:1 fat:2 cat:3 sat:4 on:5 a:6 mat:7 and:8 ate:9 a:10 fat:11 rat:12'::tsvector; tsvector ------------------------------------------------------------------------------- 'a':1,6,10 'on':5 'and':8 'ate':9 'cat':3 'fat':2,11 'mat':7 'rat':12 'sat':4
=# select 'fat:1 cat:2'::tsvector || 'fat:1 rat:2'::tsvector; ?column? --------------------------- 'cat':2 'fat':1,3 'rat':4 =# select 'fat:1 rat:2'::tsvector || 'fat:1 cat:2'::tsvector; ?column? --------------------------- 'cat':4 'fat':1,3 'rat':2 (1 row)
=# select 'fat & cat'::tsquery; tsquery --------------- 'fat' & 'cat' =# select 'fat:ab & cat'::tsquery; tsquery ------------------ 'fat':AB & 'cat'
tsqueries could be concatenated
tsquery && (ANDed) tsquery
test=# select 'a&b'::tsquery && 'c|d'::tsquery; ?column? --------------------------- 'a' & 'b' & ( 'c' | 'd' )
tsquery || (ORed) tsquery
test=# select 'a&b'::tsquery || 'c|d'::tsquery; ?column? --------------------------- 'a' & 'b' | ( 'c' | 'd' )
PostgreSQL 8.1 documentation:
PostgreSQL mailing list archive:
Document - to_tsvector → tsvector
=# select to_tsvector('a fat cat sat on a mat - it ate a fat rats'); to_tsvector ----------------------------------------------------- 'ate':9 'cat':3 'fat':2,11 'mat':7 'rat':12 'sat':4
=# select setweight( to_tsvector('story title'),'A') || setweight( to_tsvector('story abstract'),'B') || setweight( to_tsvector('story body'),'D'); ?column? -------------------------------------------------- 'bodi':6 'titl':2A 'stori':1A,3B,5 'abstract':4B
=# select strip( '''bodi'':6 ''titl'':2A ''stori'':1A,3B,5 ''abstract'':4B'); strip ---------------------------------- 'bodi' 'titl' 'stori' 'abstract'
CREATE FUNCTION dropatsymbol(text) RETURNS text AS 'select replace($1, ''@'', '' '');' LANGUAGE SQL; CREATE TRIGGER tsvectorupdate BEFORE UPDATE OR INSERT ON tblMessages FOR EACH ROW EXECUTE PROCEDURE tsearch2(tsvector,dropatsymbol, strMessage);
We want to control document → tsvector convertation and be able to do that in various ways. That means, we should be able to specify how to parse document, what lexemes to index and how to process them.
Tsearch2 provides flexible table driven configuration support.
Each configuration has unique name (ts_name), parser (prs_name) and locale name (server locale, see show all), which used to identify default configuration.
=# select * from pg_ts_cfg; ts_name | prs_name | locale -----------------+----------+-------------- default | default | C default_russian | default | ru_RU.KOI8-R utf8_russian | default | ru_RU.UTF-8 simple | default |
select show_curcfg(); show_curcfg ------------- 18037
More friendly:
=# select * from pg_ts_cfg where oid=show_curcfg(); ts_name | prs_name | locale -----------------+----------+-------------- default_russian | default | ru_RU.KOI8-R
Dictionary is a program, which accepts lexeme(s) on input and returns:
WARNING: Data files, used by dictionaries, should be in server_encoding to avoid possible problems !
Usually, dictionaries used for normalization of words ('rats'->'rat')
pg_ts_dict is a dictionaries registry. Tsearch2 provides templates for several dictionaries to simplify registering of new dictionaries
=# \d pg_ts_dict Table "public.pg_ts_dict" Column | Type | Modifiers -----------------+--------------+----------- dict_name | text | not null dict_init | regprocedure | dict_initoption | text | dict_lexize | regprocedure | not null dict_comment | text | Indexes: "pg_ts_dict_pkey" PRIMARY KEY, btree (dict_name)
Templates:
Define new dictionary 'en_ispell', which uses ispell template and has their data in '/usr/local/share/dicts/ispell/' directory:
INSERT INTO pg_ts_dict (SELECT 'en_ispell', dict_init, 'DictFile="/usr/local/share/dicts/ispell/english.dict",' 'AffFile="/usr/local/share/dicts/ispell/english.aff",' 'StopFile="/usr/local/share/dicts/english.stop"', dict_lexize FROM pg_ts_dict WHERE dict_name = 'ispell_template');
Ispell and snowball stemmers treat stop words differently:
=# select lexize('en_stem','rats'); lexize -------- {rat}
Dictionaries (synonym, thesaurus, ispell, snowball stemmer), which have predefined templates need only data.
Also, you may develop dictionary from scratch. Gendict utility, which comes with tsearch2, generates template file for dictionary program. Gendict has built-in support for snowball stemmers.
Table "public.pg_ts_parser" Column | Type | Modifiers ---------------+--------------+----------- prs_name | text | not null prs_start | regprocedure | not null prs_nexttoken | regprocedure | not null prs_end | regprocedure | not null prs_headline | regprocedure | not null prs_lextype | regprocedure | not null prs_comment | text | Indexes: "pg_ts_parser_pkey" PRIMARY KEY, btree (prs_name)
Tsearch2 comes with default parser which recognized 23 following tokens:
=# select * from token_type(); tokid | alias | descr -------+--------------+----------------------------------- 1 | lword | Latin word 2 | nlword | Non-latin word 3 | word | Word 4 | email | Email 5 | url | URL 6 | host | Host 7 | sfloat | Scientific notation 8 | version | VERSION 9 | part_hword | Part of hyphenated word 10 | nlpart_hword | Non-latin part of hyphenated word 11 | lpart_hword | Latin part of hyphenated word 12 | blank | Space symbols 13 | tag | HTML Tag 14 | protocol | Protocol head 15 | hword | Hyphenated word 16 | lhword | Latin hyphenated word 17 | nlhword | Non-latin hyphenated word 18 | uri | URI 19 | file | File or path name 20 | float | Decimal notation 21 | int | Signed integer 22 | uint | Unsigned integer 23 | entity | HTML Entity (23 rows)
Notice, space symbol is also lexem.
=# select * from parse('<b>Fat</b> cat'); tokid | token -------+------- 13 | <b> 1 | Fat 13 | </b> 12 | 1 | cat (5 rows)
Default parser could be defined using function set_curprs(parser_name).
For each lexeme class (token_type) there is a configurable dictionary queue (in pg_ts_cfgmap). Lexeme passes through this queue until it recognized by some dictionary (currently, there is no possibility to recognize lexeme and pass it to the next dictionary). It's tenable to begin from very specific dictionary (topic related, synonym), and finish queue with most common dictionary like 'simple' or 'stemmer', which recognize everything :)
An excerption from pg_ts_cfgmap:
ts_name | tok_alias | dict_name ----------------+--------------+---------------- .................................................... default_russian | lword | {en_ispell,en_stem} .................................................... default_russian | nlword | {ru_ispell,ru_stem_koi8}
If lexeme type doesn't specified in pg_ts_cfgmap or dict_name is NULL, them lexeme will not indexed. For example, we don't want to index uri' in default_russian configuration: we may delete corresponding row, but better leave it and set dict_name as NULL:
update pg_ts_cfgmap set dict_name=NULL where ts_name='default_russian' and tok_alias='uri';
Tsquery obtained at search time, using the same configuration table as tsvector.
=# select to_tsquery('fat & rats'); to_tsquery --------------- 'fat' & 'rat'
=# select plainto_tsquery('fat rats'); plainto_tsquery ----------------- 'fat' & 'rat'
Query rewriting is a set of functions and operators for tsquery type. It allows to control search at query time without reindexing (opposite to thesaurus).
Query rewriting is flexible.
Replace new york by their synonyms.
=# select rewrite('foo & bar & qq & new & york', 'new & york'::tsquery, 'big & apple | nyc | new & york & city'); rewrite ---------------------------------------------------------------------------------- 'qq' & 'foo' & 'bar' & ( 'city' & 'york' & 'new' | ( 'nyc' | 'apple' & 'big' ) ) (1 row)
Example with synonyms from table:
=# select rewrite('moscow & hotel', 'select keyword, sample from test_tsquery'::text ); rewrite ----------------------------------- ( 'moskva' | 'moscow' ) & 'hotel'
Operators could be used to speedup query rewriting, for example, filtering non-candidate tuples when search big lookup table.
contrib_regression=# select rewrite( ARRAY[query, keyword, sample] ) from test_tsquery, to_tsquery('default', 'moscow') as query where keyword ~ query; rewrite --------------------- 'moskva' | 'moscow' (1 row)
It's possible to create index to speedup operators @, ~.
create index qq on test_tsquery using gist (keyword gist_tp_tsquery_ops);
Using standard SQL commands it's possible (in transaction) to create your own configurations. Configuration name (ts_name) could be specified explicitly in functions to_tsvector, to_tsquery, plainto_tsquery, which provides a big flexibility for developing advanced search applications.
Default configuration (specified by server's locale) knows stop-words:
=# select to_tsvector('a fat cat'); to_tsvector ----------------- 'cat':3 'fat':2
But simple configurations accepts all words without bothering about stop-words:
=# select to_tsvector('simple','a fat cat'); to_tsvector ----------------------- 'a':1 'cat':3 'fat':2
To materialize search results you need to return documents in some order and show them to help user to decide what document he really wants. Ordering is an application specific function and, in principle, it doesn't belongs to the tsearch2 core.
Ranking attempts to measure how documents are relevant to particular query. Concept of relevancy is vague, since it's very application dependent. Tsearch2 provides two ranking functions rank,rank_cd, which use different algorithms, but both attempt to take into account lexical (frequency), proximity and structural information. Optionally, ranking functions could normalize rank, that specifies whether a document's length should impact its rank.
Here, '{0.1, 0.2, 0.4, 1.0}' - is an array of weights for 'D','C','B','A' zones, which stores in tsvector.
=# select get_covers (to_tsvector('a fat cat sat on a mat - it ate a fat rats'), to_tsquery('fat & cat')); get_covers ------------------------------------------ {1 fat {2 cat }1 sat mat ate fat }2 rat
Notice, ranking functions use only local information (for given document), since it's very expensive to get global statistics. That means, we can't do real normalization of rank, for example, 0 < rank < 1, but you can always cheat a bit and use transformation like rank/(rank+1).
=# select title, rank_cd('{0.1, 0.2, 0.4, 1.0}',fts, query) as rnk from apod, to_tsquery('neutrino|(dark & matter)') query where query @@ fts order by rnk desc limit 10; title | rnk -----------------------------------------------+---------- Neutrinos in the Sun | 3.1 The Sudbury Neutrino Detector | 2.4 A MACHO View of Galactic Dark Matter | 2.01317 Hot Gas and Dark Matter | 1.91171 The Virgo Cluster: Hot Plasma and Dark Matter | 1.90953 Rafting for Solar Neutrinos | 1.9 NGC 4650A: Strange Galaxy and Dark Matter | 1.85774 Hot Gas and Dark Matter | 1.6123 Ice Fishing for Cosmic Neutrinos | 1.6 Weak Lensing Distorts the Universe | 0.818218
Now, normalize rank:
=# select title, rank_cd('{0.1, 0.2, 0.4, 1.0}',fts, query)/ (rank_cd('{0.1, 0.2, 0.4, 1.0}',fts, query) + 1) as rnk from apod, to_tsquery('neutrino|(dark & matter)') query where query @@ fts order by rnk desc limit 10; title | rnk -----------------------------------------------+------------------- Neutrinos in the Sun | 0.756097569485493 The Sudbury Neutrino Detector | 0.705882361190954 A MACHO View of Galactic Dark Matter | 0.668123210574724 Hot Gas and Dark Matter | 0.65655958650282 The Virgo Cluster: Hot Plasma and Dark Matter | 0.656301290640973 Rafting for Solar Neutrinos | 0.655172410958162 NGC 4650A: Strange Galaxy and Dark Matter | 0.650072921219637 Hot Gas and Dark Matter | 0.617195790024749 Ice Fishing for Cosmic Neutrinos | 0.615384618911517 Weak Lensing Distorts the Universe | 0.450010798361481
Headline is a fragment of document with query terms. They are useful for visual representation of search results. Usually, query terms are bolded.
=# select headline('a fat cat sat on a mat and ate a fat rat','fat & cat'::tsquery); headline --------------------------------------------------------------- a <b>fat</b> <b>cat</b> sat on a mat and ate a <b>fat</b> rat =# select headline('a fat cat sat on a mat and ate a fat rat', 'fat & cat'::tsquery, 'StartSel='', StopSel='''); headline ------------------------------------------------ a 'fat' 'cat' sat on a mat and ate a 'fat' rat
There are several options, which define how to bold query terms, min,max length of headline and etc.
select id,headline(body,q),rank(ti,q) as rank from apod, to_tsquery('stars') q where ti @@ q order by rank desc limit 10;
select id,headline(body,q),rank from ( select id,body,q, rank(ti,q) as rank from apod, to_tsquery('stars') q where ti @@ q order by rank desc limit 10) as foo;
2nd query could be order of magnitude faster than 1st one ! This is because of slow headline() function, which processes a document (and so needs to read document from disk), not just an index, so 1st query has to produce as many headlines as the number of search results, while 2nd query needs to read only 10 documents. That makes the difference !
=# select * from ts_debug('a fat rats'); ts_name | tok_type | description | token | dict_name | tsvector -----------------+----------+-------------+-------+-----------+---------- default_russian | lword | Latin word | a | {en_stem} | default_russian | lword | Latin word | fat | {en_stem} | 'fat' default_russian | lword | Latin word | rats | {en_stem} | 'rat' (3 rows)
=# select * from stat('select fts_index from a_pages') order by ndoc desc, nentry desc,word limit 10; word | ndoc | nentry ------------+------+-------- postgresql | 655 | 6127 document | 655 | 3486 | 655 | 2876 manual | 655 | 2046 8.1 | 655 | 2034 rarr | 655 | 1965 text | 655 | 1424 home | 655 | 1315 copi | 655 | 1138 group | 655 | 1076
Tsearch2 supports indexed access to tsvector in order to further speedup FTS. Notice, indexes are not mandatory for FTS ! Indices support concurrency and recovery.
=# create index fts_idx on apod using gist(fts);
=# create index fts_idx on apod using gin(fts);
Generalized means that the index does not know which operation it accelerates. It works with custom strategies, defined for specific data types. Gin is similar to the GiST and differs from btree indices, which have predefined, comparison-based operations.
An inverted index is an index structure storing a set of (key, posting list) pairs, where 'posting list' is a set of document id in which the key occurs.
Use table inheritance with CE to have fast online FTS, based on GiST index and archive tables (mostly read-only) with Gin index.