FTS Notes

FTS Notes

These are the notes on PostgreSQL Full-Text Search engine.

Notice: Read my Tsearch2 83 changes summary and Official Text Search documentation

What's new after tsearch2 ?

  • FTS configuration is now schema-aware - each schema may have default configuration. Visibility of the FTS objects conforms to the standard PostgreSQL rule
  • FTS objects may be owned
  • FTS operator ( @@ ) defined for textual data types
  • SQL interface to FTS configuration
  • dump/restore correctly work with FTS
  • ispell dictionary now supports ispell, myspell, hunspell
  • relative paths to dictionary files are now resolved respective to $PGROOT/share directory - convenient in multihosting environment
  • changes in FTS configuration are immediate ( no need for session renewal)
  • better ts_debug() - now shows dictionary name, which recognize a lexeme

fillfactor

GiST index has support of fillfactor parameter in create index command, so one could vary size of GiST index. Fillfactor affects on performance of insert - it's always useful to have some additional place on page. For read-only data one could create index with full (100%) utilization of space, so index will be smaller. Default value for fillfactor is 90.

apod=# create index fts_idx on apod using gist(fts) with (fillfactor=100);
CREATE INDEX
Time: 314.583 ms
apod=# select pg_relation_size('fts_idx');
 pg_relation_size
------------------
           540672
(1 row)
apod=# create index fts_idx on apod using gist(fts) with (fillfactor=50);
CREATE INDEX
Time: 315.192 ms
apod=# select pg_relation_size('fts_idx');
 pg_relation_size
------------------
          1138688
(1 row)

GIN index doesn't yet supports fillfactor.

apod=# create index fts_idx on apod using gin(fts);
CREATE INDEX
Time: 1515.360 ms
apod=# select pg_relation_size('fts_idx');
 pg_relation_size
------------------
          3547136
(1 row)

About losiness of GiST index

It's well-known, that GiST based index for FTS is lossy. Actually, it's not true, since it has optimization for storing small tsvectors (<TOAST_INDEX_TARGET bytes, 512 bytes). On leaf pages small tsvectors stored as is, while longer one are represented by their signatures, which introduce some losiness. Unfortunately, existing index API doesn't allow to say index that it found an exact values (tsvector) or results need to be checked. That's why GiST index is marked as lossy. We hope in future to overcome this issue. We could see this using Gevel module. For experiments we obtain abstracts from arxiv.org, which, we expect, are enough short to be represented as tsvectors on leaf pages.

arxiv=# select gist_stat('gist_idx_90');
                 gist_stat                  
--------------------------------------------
 Number of levels:          4
 Number of pages:           18296
 Number of leaf pages:      17496
 Number of tuples:          435661
 Number of invalid tuples:  0
 Number of leaf tuples:     417366
 Total size of tuples:      124776048 bytes
 Total size of leaf tuples: 119803816 bytes
 Total size of index:       149880832 bytes

-- leaves
arxiv=# select * from gist_print('gist_idx_90') as 
        t(level int,valid bool, fts gtsvector) where level =4;
 level | valid |              fts               
-------+-------+--------------------------------
     4 | t     | 130 true bits, 1886 false bits
     4 | t     | 95 unique words
     4 | t     | 33 unique words
     4 | t     | 77 unique words
     4 | t     | 68 unique words
     4 | t     | 86 unique words
     4 | t     | 77 unique words
     4 | t     | 51 unique words
     4 | t     | 122 unique words
     4 | t     | 127 true bits, 1889 false bits
     4 | t     | 105 unique words
     4 | t     | 170 true bits, 1846 false bits
     4 | t     | 77 unique words
     4 | t     | 121 true bits, 1895 false bits
....................................
     4 | t     | 61 unique words
(417366 rows)

-- internal node
arxiv=# select * from gist_print('gist_idx_90') as 
        t(level int, valid bool, fts gtsvector) where level =3;

 level | valid |              fts               
-------+-------+--------------------------------
     3 | t     | 852 true bits, 1164 false bits
     3 | t     | 861 true bits, 1155 false bits
     3 | t     | 858 true bits, 1158 false bits
     3 | t     | 872 true bits, 1144 false bits
     3 | t     | 858 true bits, 1158 false bits
     3 | t     | 855 true bits, 1161 false bits
     3 | t     | 853 true bits, 1163 false bits
     3 | t     | 857 true bits, 1159 false bits
     3 | t     | 864 true bits, 1152 false bits
     3 | t     | 856 true bits, 1160 false bits
     3 | t     | 848 true bits, 1168 false bits
     3 | t     | 863 true bits, 1153 false bits
     3 | t     | 858 true bits, 1158 false bits
     3 | t     | 859 true bits, 1157 false bits
     3 | t     | 855 true bits, 1161 false bits
     3 | t     | 881 true bits, 1135 false bits
...............................................
     3 | t     | 782 true bits, 1234 false bits
     3 | t     | 773 true bits, 1243 false bits
(17496 rows)

GiN or GiST ?

Direct comparison of performance on abstract from e-print archives. Total number of abstracts - 405690. All tests were done on my desktop pc, fully loaded, P4 2.4Ghz, 2Gb RAM, Linux 2.6.19.1, Slackware,PostgreSQL 8.2.4.

postgresql.conf:

  • shared_buffers = 256MB
  • work_mem = 8MB
  • maintenance_work_mem = 64MB
  • checkpoint_segments = 9
  • effective_cache_size = 256MB

Conclusions:

  • creation time - GiN takes 3x time to build than GiST
  • size of index - GiN is 2-3 times bigger than GiST
  • search time - GiN is 3 times faster than GiST
  • update time - GiN is about 10 times slower than GiST

There is niche for both indexes. Most optimal variant is to use GiST index for online data and GiN for archive.

arxiv=# \d papers
          Table "public.papers"
      Column       |   Type   | Modifiers 
-------------------+----------+-----------
 id                | integer  | 
 oai_id            | text     | 
 datestamp         | date     | 
 title             | text     | 
 authors           | text     | 
 description       | text     | 
 comment           | text     | 
 creation_date     | date     | 
 modification_date | date     | 
 fts               | tsvector | 
Indexes:
    "mdidx" btree (modification_date)
    "pidx" btree (id)

arxiv=# select pg_relation_size('papers');
 pg_relation_size 
------------------
       1054081024

arxiv=# select count(*) from wordstat;
 count  
--------
 459841
arxiv=# update papers set fts=
setweight( to_tsvector(coalesce(title,'')),'A') || 
setweight( to_tsvector(coalesce(description,'')),'B') ||
setweight( to_tsvector(coalesce(authors,'')),'C') ||
setweight( to_tsvector(coalesce(comment,'')),'D');

Summary:

index           creation(ms)    size (b)        count(*)        rank query
-------------------------------------------------------------------------
GiN             532310.368      305864704       38.739          130.488
GIST90          176267.543      145989632       111.891     188.992
GIST100         189321.561      130465792       120.730         215.153
GIST50          164669.614      279306240       122.101         200.963

Updating:

index (nlev)      95            1035            10546
-----------------------------------------------------------
GIN             3343.881        36337.733       217577.424
GIST90  (4)     280.072         1835.485        29597.235
GIST50  (5)     238.101         2952.362        33984.443
GIST100 (4)     232.674         2460.621        27852.507

Queries:

-- count(*)
arxiv=# select count(*) from papers p,  to_tsquery('gamma&ray&burst&!supernovae') q 
where p.fts @@ q;
 count 
-------
  2764
-- ranking query
arxiv=# explain analyze select p.title, rank_cd (p.fts, q) as rank from papers p,  t
o_tsquery('gamma&ray&burst&!supernovae') q where p.fts @@ q  order by rank desc limi
t 10;
-- update query
arxiv=# update papers set fts=
setweight( to_tsvector(coalesce(title,'')),'A') || ' ' ||
setweight( to_tsvector(coalesce(description,'')),'B') || ' ' ||
setweight( to_tsvector(coalesce(authors,'')),'C') ||  ' ' ||
setweight( to_tsvector(coalesce(comment,'')),'D') where id >10 and id < 11000;