w{i} = Cpos * 1/(1+#non-query words)
We could soften a bit this formulae using parameter k, which is the number of words we don't take into account. Cpos is a harmonical mean of weights of words consisting i-th extent.
<a>a b</a><b>c d e f</b><c>a i t</c>
and query 'b & d & e & i' we have extent 'b c d e f a i', which crosses 3 parts (1a4b2c). For C(a)=1; C(b)=0.5; C(c)=0.2
Cpos = 7/(1/1+4/0.5+2/0.2) = 0.37
If no information is available about non-query words (for performance reason, for example), then it's possible to use only query words. For the example above, we approximate extent 1a4b2c by 1a2b1c, so CposApprox = 4/(1/1+2/0.5+1/0.2) = 0.36
Try to take into account extent's frequency:
W{i} = SUM(j=1,freq{i})[ w{j}/j^2 ] (1)
Notice, w{j} is a sorted array and max(W{i})= Pi^2/6; If information about extents frequency in a document is expensive to get, then we could skip this step and consider all extents as unique and use W{i}=w{i}.
W = SUM(1,Next)W{i}
W = SUM(i=1,Next)[Cpos{i}/Nonqw{i}], here Nonqw{i} is the number of non-query words in i-th extent.
Cpos{i}=len{i}/SUM(k=1,len{i})[l{k}/C{k}] or, we could use CposApprox{i} = lenq/SUM(k=1,lenq)[l{k}/C{k}]
For homogeneous both extents and document (all C{k}=Cdoc)
W = Cdoc/SUM(i=1,Next)[len{i} - lenq + 1]
For one-term query we have W1q=Cdoc*Next
It's clear, that we should apply document length normalization, because Next ~ DocLength (see below).
We should apply some normalization technique to take into account that longer document usually have more extents than short documents, i.e., more repeated extents (frequency) and more different extents. Simple technique is to divide document weight by ratio (or logarithm of ration) of document length to average length of a document in the collection. Document length could be counted in terms of words (all words or unique words) or in bytes. Unfortunately, we don't know global statistics, for example, an average length of documents in the collection, so we could use only local (document wide) statistics.
W = W/(1+log(NUW)).
NUW - is the number of unique words in a document and easily obtained from tsvector.
Under all equal conditions we prefer documents which distribution of extents is more concentrated.
Density distribution could be approximated by Dmean - harmonical mean of all distances between extents.
coordinates: 1 2 3 4 5 6 500 distances: 1,1,1,1,1,494 mean=83.1666666666667 geom=2.81160619455686 harm=1.19951436665318 median=1
Cden = 1/(1+log(Dmean)) - correction for density is a global parameter, so it can be applied to the final document weight.
W = W*Cden.
Current interface of rank_cd:
rank_cd('{0.1, 0.2, 0.4, 1.0}',tsvector,tsquery,normalization)
normalization is used to select normalization method(s) and is OR-ed value of following methods:
There is no normalization by default !Example - normalize document rank by logarithm of document length and take into account extents density.
rank_cd('{0.1, 0.2, 0.4, 1.0}',tsvector,tsquery,1|4)
What hammer is better ?
A B EEEEEEEEEEEEEE EEEEE D D D D D D
Explanation: