haystack icon indicating copy to clipboard operation
haystack copied to clipboard

perf: enhanced `InMemoryDocumentStore` BM25 query efficiency with incremental indexing

Open Guest400123064 opened this issue 2 months ago • 4 comments

Related Issues

This proposal was first made as a stand-alone Haystack document store integration, which is linked to issue number 218 in haystack-integrations repo.

Proposed Changes:

Instead of reindexing with every new query, I choose to perform incremental indexing on document changes. This results in modifications primarily to write_documents, delet_documents, and bm25_retrieval.

How did you test it?

As suggested by @julian-risch, the change should be non-breaking. Therefore, the test was performed with test cases implemented in test/document_stores/test_in_memory.py. 81 test cases passed and 3 cases failed with explainable causes:

  • TestMemoryDocumentStore::test_from_dict: self.bm25_algorithm now points to the string literal of the algorithm name, instead of a BM25 object. So, it does not have the .__name__ attribute.
  • TestMemoryDocumentStore::test_bm25_retrieval_with_non_scaled_BM25Okapi: this is caused by the pytest fixture initializing a BM25L document store and the test case modified the underlying algorithm not from initializer, making the underlying algorithm being BM25L instead of Okapi BM25. Changing the initialized algorithm will result in a pass.
  • TestMemoryDocumentStore::test_bm25_retrieval_with_text_and_table_content: the non-matching documents have tied scores. The test case got a "lucky pass" because NumPy quick-sort alters the document orders even when the scores are the same.

Notes for the reviewer

Any suggestion is appreciated :)

Checklist

Guest400123064 avatar Apr 12 '24 23:04 Guest400123064

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Apr 12 '24 23:04 CLAassistant

@Guest400123064 Thank you for opening this PR! We really appreciate it. Our team will need a little bit more time to review your PR. Having had a first quick look, I think we can remove the haystack_bm25 dependency from the project here and remove the import also from the tests here if it is not used anymore in this single test.

julian-risch avatar Apr 18 '24 10:04 julian-risch

Thanks for the reply! Yea, theoretically it should completely replicate rank_bm25; I haven't done an extensive exact comparison, e.g, with fake data generated by hypothesis. But I am wondering if I should directly benchmark the retrieval performance instead of trying to match rank_bm25.

Guest400123064 avatar Apr 18 '24 12:04 Guest400123064

Hi @Guest400123064, thanks for your contribution, this is very good work! I left some initial suggestions.

davidsbatista avatar Apr 24 '24 09:04 davidsbatista

Pull Request Test Coverage Report for Build 8938434225

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 5 unchanged lines in 1 file lost coverage.
  • Overall coverage increased (+0.2%) to 90.333%

Files with Coverage Reduction New Missed Lines %
document_stores/in_memory/document_store.py 5 98.04%
<!-- Total: 5
Totals Coverage Status
Change from base Build 8937849375: 0.2%
Covered Lines: 6513
Relevant Lines: 7210

💛 - Coveralls

coveralls avatar May 02 '24 09:05 coveralls