gensim icon indicating copy to clipboard operation
gensim copied to clipboard

Implement Okapi BM25 variants in Gensim

Open Witiko opened this issue 3 years ago • 20 comments

This pull request implements the gensim.models.bm25model module, which contains an implementation of the Okapi BM25 model and its modifications (Lucene BM25 and ATIRE) as discussed in https://github.com/RaRe-Technologies/gensim/issues/2592#issuecomment-866799145. The module acts as a replacement for the gensim.summarization.bm25model module deprecated and removed in Gensim 4. The module should supersede the gensim.models.tfidfmodel module as the baseline weighting function for information retrieval and related NLP tasks.

Most implementations of BM25 such as the rank-bm25 library combine dictionary building, weighting, and indexing. To give an example, here is how a user would search for documents with rank-bm25:

from rank_bm25 import BM25Okapi

corpus = [["Hello", "world"], ["bar", "bar"], ["foo", "bar"]]
bm25_model = BM25Okapi(corpus)

query = [["Hello", "bar"]]
similarities = bm25_model.get_scores(query)

As you can see, the interface is convenient, but any advanced operations such as pruning the dictionary, applying semantic matching (e.g. SCM) and query expansion (e.g. RM3), or sharding the index are unavailable.

By contrast, the gensim.models.bm25 module separates the three operations. To give an example, here is how a user would search for documents with the gensim.models.bm25 module:

from gensim.corpora import Dictionary
from gensim.models import TfidfModel, OkapiBM25Model
from gensim.similarities import SparseMatrixSimilarity

corpus = [["Hello", "world"], ["bar", "bar"], ["foo", "bar"]]
dictionary = Dictionary(corpus)
bm25_model = OkapiBM25Model(dictionary=dictionary)
bm25_corpus = bm25_model[list(map(dictionary.doc2bow, corpus))]
bm25_index = SparseMatrixSimilarity(bm25_corpus, num_docs=len(corpus), num_terms=len(dictionary),
                                    normalize_queries=False, normalize_documents=False)

query = [["Hello", "bar"]]
tfidf_model = TfidfModel(dictionary=dictionary, smartirs='bnn')  # Enforce binary weighting of queries
tfidf_query = tfidf_model[dictionary.doc2bow(query)]

similarities = bm25_index[tfidf_query]

Tasks:

  • [x] Add Okapi BM25, ~~BM25L and BM25+~~ [1, 2], Lucene BM25 [3, 4], and ATIRE BM25 [3, 5].
  • [x] Add comments and docstrings to models.bm25.
  • [x] Add comments and docstrings to similarities.docsim.
  • [x] Add BM25 to the run_topics_and_transformations autoexample.
  • [x] Add normalize_queries=True, normalize_documents=True named parameters to SparseMatrixSimilarity, DenseMatrixSimilarity, and SoftCosineSimilarity classes as discussed in https://github.com/RaRe-Technologies/gensim/pull/3304#issuecomment-1061031969 and on the Gensim mailing list. Deprecate the normalize named parameter of SoftCosineSimilarity. Add normalize_queries=False, normalize_documents=False to TF-IDF and BM25 examples.

Witiko avatar Mar 04 '22 19:03 Witiko

Pretty nice! I'll look into this after the 4.2 release.

piskvorky avatar Mar 04 '22 19:03 piskvorky

Codecov Report

Merging #3304 (f43806d) into develop (ac3bbcd) will decrease coverage by 1.77%. The diff coverage is 95.74%.

:exclamation: Current head f43806d differs from pull request most recent head b4843cc. Consider uploading reports for the commit b4843cc to get more accurate results

@@             Coverage Diff             @@
##           develop    #3304      +/-   ##
===========================================
- Coverage    81.43%   79.66%   -1.78%     
===========================================
  Files          122       69      -53     
  Lines        21052    11875    -9177     
===========================================
- Hits         17144     9460    -7684     
+ Misses        3908     2415    -1493     
Impacted Files Coverage Δ
gensim/models/bm25model.py 95.74% <95.74%> (ø)
gensim/scripts/glove2word2vec.py 76.19% <0.00%> (-7.15%) :arrow_down:
gensim/corpora/wikicorpus.py 93.75% <0.00%> (-1.04%) :arrow_down:
gensim/matutils.py 77.23% <0.00%> (-0.90%) :arrow_down:
gensim/similarities/docsim.py 23.95% <0.00%> (-0.76%) :arrow_down:
gensim/models/rpmodel.py 89.47% <0.00%> (-0.53%) :arrow_down:
gensim/models/ldamulticore.py 90.58% <0.00%> (-0.33%) :arrow_down:
gensim/utils.py 71.86% <0.00%> (-0.12%) :arrow_down:
gensim/corpora/dictionary.py 94.17% <0.00%> (-0.09%) :arrow_down:
gensim/models/hdpmodel.py 71.27% <0.00%> (-0.08%) :arrow_down:
... and 91 more

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

codecov[bot] avatar Mar 04 '22 21:03 codecov[bot]

@piskvorky Thank you. No need to look just yet, I have tried some benchmarks and the code seems to have issues both with speed and with correctness. I will let you know when the PR is ready for review; it is just a draft for the moment.

Witiko avatar Mar 05 '22 13:03 Witiko

I have experimentally confirmed compatibility of 9ab6f52 (Okapi BM25) with the rank-bm25 library:

I have also outlined some issues with the default behavior of the DenseMatrixSimilarity and SparseMatrixSimilarity indexes, which are likely to bite even experienced users and decrease the accuracy of their results with BM25, on the Gensim mailing list.

Witiko avatar Mar 07 '22 18:03 Witiko

The *MatrixSimilarity stuff is the oldest part of Gensim, along with the LsiModel. It dates back to DML-CZ days, in ancient pre-history :) (definitely pre-github)

To me it makes perfect sense to control the index & query normalization via a parameter. Are you able to add such option @Witiko? We have to keep the defaults 100% backward compatible though.

piskvorky avatar Mar 07 '22 18:03 piskvorky

Not a problem, I added it to my task list. The SoftCosineSimilarity constructor uses a single parameter normalized that takes a two-tuple of booleans, one for queries and one for documents; let's deprecate that and have normalize_queries and normalize_documents across the SimilarityABC subclasses, seems more readable.

Witiko avatar Mar 07 '22 19:03 Witiko

OK.

piskvorky avatar Mar 07 '22 21:03 piskvorky

In https://github.com/RaRe-Technologies/gensim/commit/f43806d433955407c39f6f58a9045f847871b9d9, I halfway-implemented BM25L, but I realized that it is difficult to fully implement as a sparse vector space model. That is because in BM25L, document vectors have the weight δ (typically set to δ = 0.5) for all terms that did not occur in the document, which eliminates sparsity. This could be implemented efficiently if scipy.sparse supported a flag that would make the value of zero elements not zero but a different constant, which I doubt it does. Alternatively, we could have a special-purpose index just for BM25L, but that seems to defeat the purpose of implementing things in Gensim, which is interoperability with other vector space concepts and models. Therefore, I plan to abandon BM25L and focus at BM25+ next.

Witiko avatar Mar 16 '22 19:03 Witiko

In be7a0e68, I halfway-implemented BM25+, but similarly to BM25L, BM25+ dictates that every query term contributes at least δ to the score even if it did not appear in a document, which makes it difficult to implement BM25+ as a vector space model.

Witiko avatar Apr 01 '22 11:04 Witiko

Isn't there a way to factor out such "constant" contributions?

Btw the sparse implementation in scipy is pretty "dumb" (in the sense of unoptimized). So writing a specialized for-loop or two, with "sparse non-zero" semantics shouldn't be terribly difficult, even without clever factoring math tricks.

piskvorky avatar Apr 01 '22 12:04 piskvorky

@piskvorky Even if speed is not a concern, the added complexity of duplicating scipy.sparse code seems difficult to justify, since the intent would be just to support scoring functions that have no natural vector space formulation. Personally, I am happy to support "just" Okapi BM25 and a couple more well-behaved variants such as Lucene's BM25 and ATIRE BM25.

Witiko avatar Apr 01 '22 13:04 Witiko

Okay, up to you. I'm not familiar with the motivation BM25+ and BM25L etc, can't judge how relevant / in-demand by users they are.

piskvorky avatar Apr 01 '22 17:04 piskvorky

They are quite relevant. For example, BM25+ outperformed [1] Okapi BM25 on all TREC GOV2, WT100g, WT2G, and Robust04 information retrieval collections. However, the improvement is meager compared the improvement that Okapi BM25 brings over TF-IDF. Therefore, I would rather have Okapi BM25 now (should be ready for review by the end of the week) and add some of the more exotic variants of BM25 later on over inflating the scope of this PR and dragging it out for another month.

Witiko avatar Apr 01 '22 17:04 Witiko

@piskvorky The PR should be ready for review.

Witiko avatar Apr 01 '22 23:04 Witiko

I have performed a sanity check of https://github.com/RaRe-Technologies/gensim/pull/3304/commits/90bc47085e3527b9c8ab4d7d86203721320100a3 using the TREC 6–8 collection:

Vector space model MAP score
Bag of words 6.00%
TF-IDF (nfc.nfc) 13.68%
TF-IDF (dtb.nnn) 22.83%
OkapiBM25Model 24.93%
LuceneBM25Model 26.28%
AtireBM25Model 26.28%

I have already checked that our implementation of OkapiBM25Model agrees with rank_bm25.OkapiBM25 in https://github.com/RaRe-Technologies/gensim/pull/3304#issuecomment-1061000689.

Witiko avatar Apr 06 '22 11:04 Witiko

@piskvorky @gojomo Please, let me know if there is anything I can do to help move this PR along.

Witiko avatar May 12 '22 15:05 Witiko

I'm not close enough to prior related work, or potential future uses, of this to feel qualified to do a close review for correctness/risks/etc. My opinion is thus limited to: sounds useful, and a very quick glance over the diffs makes it look like docs/tests/clean-coding-practices are all in order. So I'd encourage promoting to others better-able to test/use.

gojomo avatar May 18 '22 19:05 gojomo

@Witiko busy ATM, sorry. I plan to get to this in June – is there anything urgent riding on the merge from your side? Some deadlines?

piskvorky avatar May 19 '22 06:05 piskvorky

@piskvorky Thank you. It is not a problem for me; I can point my projects to my fork of gensim:

gensim @ git+https://github.com/witiko/gensim.git@feature/bm25#egg=gensim

The main purpose of this pull request is to bring the BM25 model to the users of gensim.

Witiko avatar May 19 '22 09:05 Witiko

Ping @piskvorky and @gojomo. This pull request implements BM25, the only bag-of-words weighting scheme that is actually used for information retrieval in the wild. This is not a controversial statement: Lucene dropped TF-IDF in favor of BM25 back in 2016, it's time to get Gensim back on track. The code is documented, unit-tested, benchmarked, passes the CI, and my review. Please, let me know if there is anything I can do to champion this PR and help move its review along.

Witiko avatar Jul 18 '22 20:07 Witiko

@piskvorky @gojomo when would Gensim provide support for BM25? This feature is hanging for a long long time now.

Would this ever made it to the library!

ramsey-coding avatar Aug 27 '22 10:08 ramsey-coding

@Witiko what's the equivalent API call for

  • bm25.get_top_n(tokenized_query, corpus, n=80)

ramsey-coding avatar Aug 27 '22 19:08 ramsey-coding

I don't think there is an equivalent API call. You can get all similarities, run argmax over them and take the top 80.

Witiko avatar Aug 27 '22 20:08 Witiko

This would have been very useful for me during the last few weeks. Sadly, there doesn't seem to be much interest in BM25 here.

dunefox avatar Aug 27 '22 20:08 dunefox

I don't think there is an equivalent APU call. Get all similarities, run argmax over them and take the top 80.

@Witiko I don't follow it. You are saying for a given query, I iterate over the whole 350K datapoint (in the corpus) and get similarity and then take top 80.

This would not scale at all. 😭

ramsey-coding avatar Aug 27 '22 20:08 ramsey-coding

@Witiko also the API of gensim is neither user friendly nor convenient. It appears it just provide a similarity score and does not even provide the original document. Devs need to maintain an external data structure to retrieve the original document.

It appears to me Gensim was gold back in the days. But now it is an old, stale, and out-dated library and neither wants to move forward.

Probably time to abandon this library and devs should look for better alternative that provide easier API access and more functionalities.

ramsey-coding avatar Aug 27 '22 20:08 ramsey-coding

What I am saying is that it will be significantly faster than rank-bm25 at retrieval time. (This is a continued discussion from https://github.com/dorianbrown/rank_bm25/issues/27 and https://github.com/dorianbrown/rank_bm25/issues/25.)

Witiko avatar Aug 27 '22 20:08 Witiko

What I am saying is that it will be significantly faster than rank-bm25.

got it

ramsey-coding avatar Aug 27 '22 20:08 ramsey-coding

Gensim will get you similarities in the order of indexing, i.e. if you index documents 1, 2, and 3, and then perform a similarity query, you will get back similarities between the query and documents 1, 2, and 3, respectively.

Witiko avatar Aug 27 '22 20:08 Witiko

@Witiko you are phenomenal, thanks for all the great feedback.

I have one more question:

If I set num_best=80 here:

SparseMatrixSimilarity(bm25_corpus,
                                        num_docs=len(corpus),
                                        num_terms=len(dictionary),
                                        normalize_queries=False,
                                        normalize_documents=False,
                                        num_best=80) // I set num_best=80 and want to get top 80 documents 

And then get similarity like the following, would the result would be sorted by most matched document?

    tfidf_model = TfidfModel(dictionary=bm25_dictionary, smartirs='bnn')  # Enforce binary weighting of queries
    tfidf_query = tfidf_model[bm25_dictionary.doc2bow(tokenized_query)]

    similarities = bm25_index[tfidf_query]
    for doc_no, score in bm25_index[tfidf_query]:
        print("original document:", test_methods_corpus[doc_no])

So question is here the result of bm25_index[tfidf_query] would be sorted based on most matched documents or not?

ramsey-coding avatar Aug 27 '22 20:08 ramsey-coding

@Witiko wow, awesome work. In the context of this implementation:

similarities = bm25_index[tfidf_query]
  • is higher score means the document is more similar to the query?
  • Or lower score means the document is more similar to the query.

Sorry for the stupid question.

smith-co avatar Aug 28 '22 00:08 smith-co

This feature would be very useful for me.

nashid avatar Aug 28 '22 02:08 nashid

@smith-co The similarities are BM25 scores, i.e. the higher the similarity, the more similar the document is to your query.

Witiko avatar Aug 28 '22 09:08 Witiko

@ramsey-coding @smith-co I added outputs to the example code in the original post. Furthermore, I also added an example showing how you can get back the best document for a query. I hope you will find this useful. 😉

Witiko avatar Aug 28 '22 15:08 Witiko

So question is here the result of bm25_index[tfidf_query] would be sorted based on most matched documents or not?

@ramsey-coding Sorry, I wasn't at my computer over the weekend. Yes, your understanding is correct; specifying num_best=80 in SparseMatrixSimilarity(...) will cause bm25_index[tfidf_query] to produce an iterable of 80 document ids and similarities sorted from the most matched document in the descending order of similarity.

Witiko avatar Aug 29 '22 11:08 Witiko

I would really appreciate merging this functionality as I must now use my own custom implementation of BM25 when working with the Gensim library.

mgeletka avatar Aug 30 '22 07:08 mgeletka

Code looks nice and clean, sorry for taking so long to review.

@mpenkov anything else we need before merge?

@Witiko how about post-merge? What can we do to promote this functionality (beyond including it in the Gensim gallery)?

piskvorky avatar Aug 30 '22 12:08 piskvorky

@piskvorky Thank you for taking the time. We can mention in the release notes that Gensim can now be used for Lucene-style information retrieval.

Witiko avatar Aug 30 '22 13:08 Witiko

Sure, it goes into the release notes without saying.

I meant more like some demo, or a practical use-case (who'd use the gensim implementation and why?), or similar. A motivational part, to anchor the technical part.

Maybe @dunefox @mgeletka @smith-co @nashid @ramsey-coding could help?

piskvorky avatar Aug 30 '22 13:08 piskvorky

@mpenkov anything missing here?

Let's aim to release soon after merging, to get this feature out. Thanks.

piskvorky avatar Sep 07 '22 14:09 piskvorky

Sorry for the delay guys, merging.

Thank you for your efforts and your patience @Witiko

mpenkov avatar Sep 08 '22 00:09 mpenkov

Thanks Misha!

@dunefox @mgeletka @smith-co @nashid @ramsey-coding could you write a few sentences about how you use Okapi BM25, or intend to use it?

Your story, your use-case, your motivation to participate in this PR.

piskvorky avatar Sep 08 '22 07:09 piskvorky