Posts

Showing posts with the label Information Retrival

Information Retrival--BM25

BM25: The next generation of TF*IDF TF: Term Frequency, eg. how often does "apple" occur in the article? IDF: Inverse Document Frequency, eg. the document frequency measures how many docs a term appear in. IDF (1/df) measures how special the term is. BM25 stands for " Best Match25 ". BM25 improves upon TF*IDF. BM25 has its roots in probability information retrieval . Probability Information Retrieval: It casts relevance as a probability problem. A relevance score ought to reflect the probability a user will consider the result relevant. BM25's IDF has the potential for giving negative scores for terms with high document frequency. Bayes Decision Rule: A document D is relevant if P(R|D) > P(NR|D) P(R|D) = P(D|R)P(R)/P(D) P(NR|D) = P(D|NR)P(NR)/P(D) -->P(D|R)/P(D|NR)>P(NR)/P(R) Estimating P(D|R): Assume independence P(D|R)=P(d0|R)P(d1|R)...P(dt|R) Binary independence model : Document represented by a vector of binary features indica...