N - total number of words (unique?) in the document collection - dimensions ND - total number of documents di: {w_1i,w_2i,...,w_ni}, dj: {w_1j,w_2j,...,w_nj} w_ki = 0, if k-th word not in i-th document, else w_ki = f_ki*log(ND/nd_k) - weight of k-th word in i-th document, nd_k - number of documents, which contains k-th word f_ki - local frequency of k-th word in i-th document, kind of local importance. f_ki=n_ki - simply count f_ki=1+log(n_ki), n_ki > 0, f_ki=0, n_ki=0 f_ki=n_ki/sum(n_ki) - normalize to the document length; f_ki=n_ki/max(n_ki) sim(di,dj) = DI*DJ/(|DI|*|DJ|) => sim(di,dj) = sum(k=1,N, w_ki*w_kj)/ ( sqrt(sum(k=1,N,w_ki^2))*sqrt(sum(k=1,N,w_kj^2))); in sum(k=1,N, w_ki*w_kj) we consider only intersected entries. 0<=sim()<=1
example:
ND=2 d1 = "to be or not to be" d2 = "to be and not to be" words = {and, be, not, or, to} ,N=5 - dimensions d1 = {X,be,not,or,to} d2 = {and,be,not,X,to} sim(Jaccard) = d1 \land d2/(d1 \lor d2) = 5/12 ( 3/5, if unique) ================== w_ki=n_ki - counts d1={0,2,1,1,2}, |d1|=sqrt(10) d2={1,2,1,0,2}, |d2|=sqrt(10) D1*D2 = (0*1+2*2+1*1+1*0+2*2) = 9 sim=9/10 ================= w_ki=n_ki*log(ND/nd_k) - normalized d1={0,0,0,0.3,0} d2={0.3,0,0,0,0} sim=0 !!!??? - since d1,d2 consists of spam words only variant: w_ki=n_ki*log(1+ND/nd_k) d1={0,1,1,1.3,1} d2={1.3,1,1,0,1} sim=3/4.69=0.64, ================== w_ki=n_ki*(ND/nd_k) d1={0,2,2,2,2} d2={2,2,2,0,2} sim=12/16=3/4
Notice: s1=similarity ( {1,2,3},{1,2,3} ) = 1, s2=similarity ( {1,2,3,4,….,100},{1,2,3,4,….,100} ) = 1. BUT, I'd say s2 must be bigger than s1, so we need to further normalize our sim() to take into account the absolute size of overlapped region.
sim = sim*(1/(1+log(NMAX/Noverlapped)) with NMAX - maximum size of overlapped region = maximum document length Noverlapped - size of the overlapped region