These are the notes on PostgreSQL Full-Text Search engine.
Notice: Read my Tsearch2 83 changes summary and Official Text Search documentation
GiST index has support of fillfactor parameter in create index command, so one could vary size of GiST index. Fillfactor affects on performance of insert - it's always useful to have some additional place on page. For read-only data one could create index with full (100%) utilization of space, so index will be smaller. Default value for fillfactor is 90.
apod=# create index fts_idx on apod using gist(fts) with (fillfactor=100); CREATE INDEX Time: 314.583 ms apod=# select pg_relation_size('fts_idx'); pg_relation_size ------------------ 540672 (1 row) apod=# create index fts_idx on apod using gist(fts) with (fillfactor=50); CREATE INDEX Time: 315.192 ms apod=# select pg_relation_size('fts_idx'); pg_relation_size ------------------ 1138688 (1 row)
GIN index doesn't yet supports fillfactor.
apod=# create index fts_idx on apod using gin(fts); CREATE INDEX Time: 1515.360 ms apod=# select pg_relation_size('fts_idx'); pg_relation_size ------------------ 3547136 (1 row)
It's well-known, that GiST based index for FTS is lossy. Actually, it's not true, since it has optimization for storing small tsvectors (<TOAST_INDEX_TARGET bytes, 512 bytes). On leaf pages small tsvectors stored as is, while longer one are represented by their signatures, which introduce some losiness. Unfortunately, existing index API doesn't allow to say index that it found an exact values (tsvector) or results need to be checked. That's why GiST index is marked as lossy. We hope in future to overcome this issue. We could see this using Gevel module. For experiments we obtain abstracts from arxiv.org, which, we expect, are enough short to be represented as tsvectors on leaf pages.
arxiv=# select gist_stat('gist_idx_90'); gist_stat -------------------------------------------- Number of levels: 4 Number of pages: 18296 Number of leaf pages: 17496 Number of tuples: 435661 Number of invalid tuples: 0 Number of leaf tuples: 417366 Total size of tuples: 124776048 bytes Total size of leaf tuples: 119803816 bytes Total size of index: 149880832 bytes -- leaves arxiv=# select * from gist_print('gist_idx_90') as t(level int,valid bool, fts gtsvector) where level =4; level | valid | fts -------+-------+-------------------------------- 4 | t | 130 true bits, 1886 false bits 4 | t | 95 unique words 4 | t | 33 unique words 4 | t | 77 unique words 4 | t | 68 unique words 4 | t | 86 unique words 4 | t | 77 unique words 4 | t | 51 unique words 4 | t | 122 unique words 4 | t | 127 true bits, 1889 false bits 4 | t | 105 unique words 4 | t | 170 true bits, 1846 false bits 4 | t | 77 unique words 4 | t | 121 true bits, 1895 false bits .................................... 4 | t | 61 unique words (417366 rows) -- internal node arxiv=# select * from gist_print('gist_idx_90') as t(level int, valid bool, fts gtsvector) where level =3; level | valid | fts -------+-------+-------------------------------- 3 | t | 852 true bits, 1164 false bits 3 | t | 861 true bits, 1155 false bits 3 | t | 858 true bits, 1158 false bits 3 | t | 872 true bits, 1144 false bits 3 | t | 858 true bits, 1158 false bits 3 | t | 855 true bits, 1161 false bits 3 | t | 853 true bits, 1163 false bits 3 | t | 857 true bits, 1159 false bits 3 | t | 864 true bits, 1152 false bits 3 | t | 856 true bits, 1160 false bits 3 | t | 848 true bits, 1168 false bits 3 | t | 863 true bits, 1153 false bits 3 | t | 858 true bits, 1158 false bits 3 | t | 859 true bits, 1157 false bits 3 | t | 855 true bits, 1161 false bits 3 | t | 881 true bits, 1135 false bits ............................................... 3 | t | 782 true bits, 1234 false bits 3 | t | 773 true bits, 1243 false bits (17496 rows)
Direct comparison of performance on abstract from e-print archives. Total number of abstracts - 405690. All tests were done on my desktop pc, fully loaded, P4 2.4Ghz, 2Gb RAM, Linux 2.6.19.1, Slackware,PostgreSQL 8.2.4.
postgresql.conf:
Conclusions:
There is niche for both indexes. Most optimal variant is to use GiST index for online data and GiN for archive.
arxiv=# \d papers Table "public.papers" Column | Type | Modifiers -------------------+----------+----------- id | integer | oai_id | text | datestamp | date | title | text | authors | text | description | text | comment | text | creation_date | date | modification_date | date | fts | tsvector | Indexes: "mdidx" btree (modification_date) "pidx" btree (id) arxiv=# select pg_relation_size('papers'); pg_relation_size ------------------ 1054081024 arxiv=# select count(*) from wordstat; count -------- 459841 arxiv=# update papers set fts= setweight( to_tsvector(coalesce(title,'')),'A') || setweight( to_tsvector(coalesce(description,'')),'B') || setweight( to_tsvector(coalesce(authors,'')),'C') || setweight( to_tsvector(coalesce(comment,'')),'D');
Summary:
index creation(ms) size (b) count(*) rank query ------------------------------------------------------------------------- GiN 532310.368 305864704 38.739 130.488 GIST90 176267.543 145989632 111.891 188.992 GIST100 189321.561 130465792 120.730 215.153 GIST50 164669.614 279306240 122.101 200.963
Updating:
index (nlev) 95 1035 10546 ----------------------------------------------------------- GIN 3343.881 36337.733 217577.424 GIST90 (4) 280.072 1835.485 29597.235 GIST50 (5) 238.101 2952.362 33984.443 GIST100 (4) 232.674 2460.621 27852.507
Queries:
-- count(*) arxiv=# select count(*) from papers p, to_tsquery('gamma&ray&burst&!supernovae') q where p.fts @@ q; count ------- 2764 -- ranking query arxiv=# explain analyze select p.title, rank_cd (p.fts, q) as rank from papers p, t o_tsquery('gamma&ray&burst&!supernovae') q where p.fts @@ q order by rank desc limi t 10; -- update query arxiv=# update papers set fts= setweight( to_tsvector(coalesce(title,'')),'A') || ' ' || setweight( to_tsvector(coalesce(description,'')),'B') || ' ' || setweight( to_tsvector(coalesce(authors,'')),'C') || ' ' || setweight( to_tsvector(coalesce(comment,'')),'D') where id >10 and id < 11000;