BM25

August 07, 2024

My thoughts

from simonw on HN:

BM25 is similar to TF/IDF. In both cases, the key idea is to consider statistics of the overall corpus as part of relevance calculations. If the user searches for “charities in new orleans” in a corpus where “new orleans” is only represented in a few documents, those should clearly rank highly. If the corpus has “new orleans” in almost every document then the term “charity” is more important.

PostgreSQL FTS cannot do this, because it doesn’t maintain statistics for word frequencies across the entire corpus. This severely limits what it can implement in terms of relevance scoring - each result is scored based purely on if the search terms are present or not.

from https://news.ycombinator.com/item?id=41173986

Read the article: BM25