Someone asked us about how to find documents, which contain emails. One possible solution is to write function
CREATE OR REPLACE FUNCTION document_token_types(text) RETURNS _text AS $$ SELECT ARRAY ( SELECT DISTINCT alias FROM ts_token_type('default') AS tt, ts_parse('default', $1) AS tp WHERE tt.tokid = tp.tokid ); $$ LANGUAGE SQL immutable; arxiv=# select document_token_types(title) from papers limit 10; document_token_types --------------------------------------------------------------- {asciihword,asciiword,blank,hword_asciipart} {asciiword,blank} {asciiword,blank} {asciiword,blank} {asciiword,blank} {asciiword,blank,float,host} {asciiword,blank} {asciihword,asciiword,blank,hword_asciipart,int,numword,uint} {asciiword,blank} {asciiword,blank} (10 rows)
Now we can create functional GIN index on document (which can be any combination of text fields) to speedup search.
create index fts_types_idx on papers using gin( document_token_types (comment) );
Find all documents with urls
explain analyze select comment from papers where document_token_types(comment) && '{url}'; Bitmap Heap Scan on papers (cost=40.80..8135.97 rows=2098 width=67) (actual time=6.811..27.679 rows=15483 loops=1) Recheck Cond: (document_token_types(comment) && '{url}'::text[]) -> Bitmap Index Scan on fts_types_idx (cost=0.00..40.28 rows=2098 width=0) (actual time=3.820..3.820 rows=15483 loops=1) Index Cond: (document_token_types(comment) && '{url}'::text[]) Total runtime: 33.124 ms
The list of available token types, supported by parser is available as
arxiv=# select * from ts_token_type('default');