Information Retrival--BM25

BM25: The next generation of TF*IDF

TF: Term Frequency, eg. how often does "apple" occur in the article?

IDF: Inverse Document Frequency, eg. the document frequency measures how many docs a term appear in. IDF (1/df) measures how special the term is.

BM25 stands for "Best Match25". BM25 improves upon TF*IDF. BM25 has its roots in probability information retrieval.

Probability Information Retrieval:
It casts relevance as a probability problem. A relevance score ought to reflect the probability a user will consider the result relevant.

BM25's IDF has the potential for giving negative scores for terms with high document frequency.

Bayes Decision Rule:
A document D is relevant if P(R|D) > P(NR|D)
P(R|D) = P(D|R)P(R)/P(D)
P(NR|D) = P(D|NR)P(NR)/P(D)
-->P(D|R)/P(D|NR)>P(NR)/P(R)

Estimating P(D|R):
Assume independence P(D|R)=P(d0|R)P(d1|R)...P(dt|R)

Binary independence model:
Document represented by a vector of binary features indicating term occurrence
pi is probability that term i occurs in relevant document, si is probability of occurrence in non-relevant document.


Reference:

https://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/

Comments

Popular posts from this blog

Reading CLIP

Reading CutPaste

OOD-related papers