haystack perf: enhanced `InMemoryDocumentStore` BM25 query efficiency with incremental indexing

Related Issues

This proposal was first made as a stand-alone Haystack document store integration, which is linked to issue number 218 in haystack-integrations repo.

Proposed Changes:

Instead of reindexing with every new query, I choose to perform incremental indexing on document changes. This results in modifications primarily to write_documents, delet_documents, and bm25_retrieval.

How did you test it?

As suggested by @julian-risch, the change should be non-breaking. Therefore, the test was performed with test cases implemented in test/document_stores/test_in_memory.py. 81 test cases passed and 3 cases failed with explainable causes:

TestMemoryDocumentStore::test_from_dict: self.bm25_algorithm now points to the string literal of the algorithm name, instead of a BM25 object. So, it does not have the .__name__ attribute.
TestMemoryDocumentStore::test_bm25_retrieval_with_non_scaled_BM25Okapi: this is caused by the pytest fixture initializing a BM25L document store and the test case modified the underlying algorithm not from initializer, making the underlying algorithm being BM25L instead of Okapi BM25. Changing the initialized algorithm will result in a pass.
TestMemoryDocumentStore::test_bm25_retrieval_with_text_and_table_content: the non-matching documents have tied scores. The test case got a "lucky pass" because NumPy quick-sort alters the document orders even when the scores are the same.

Notes for the reviewer

Any suggestion is appreciated :)

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added unit tests and updated the docstrings
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:.
I documented my code
I ran pre-commit hooks and fixed any issue

Apr 12 '24 23:04 Guest400123064

All committers have signed the CLA.

Apr 12 '24 23:04 CLAassistant

@Guest400123064 Thank you for opening this PR! We really appreciate it. Our team will need a little bit more time to review your PR. Having had a first quick look, I think we can remove the haystack_bm25 dependency from the project here and remove the import also from the tests here if it is not used anymore in this single test.

Apr 18 '24 10:04 julian-risch

Thanks for the reply! Yea, theoretically it should completely replicate rank_bm25; I haven't done an extensive exact comparison, e.g, with fake data generated by hypothesis. But I am wondering if I should directly benchmark the retrieval performance instead of trying to match rank_bm25.

Apr 18 '24 12:04 Guest400123064

Hi @Guest400123064, thanks for your contribution, this is very good work! I left some initial suggestions.

Apr 24 '24 09:04 davidsbatista

Pull Request Test Coverage Report for Build 8938434225

Details

0 of 0 changed or added relevant lines in 0 files are covered.
5 unchanged lines in 1 file lost coverage.
Overall coverage increased (+0.2%) to 90.333%

Files with Coverage Reduction	New Missed Lines	%
document_stores/in_memory/document_store.py	5	98.04%
<!--	Total:	5

Totals
Change from base Build 8937849375:	0.2%
Covered Lines:	6513
Relevant Lines:	7210

💛 - Coveralls

May 02 '24 09:05 coveralls

haystack haystack copied to clipboard

perf: enhanced `InMemoryDocumentStore` BM25 query efficiency with incremental indexing

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

Pull Request Test Coverage Report for Build 8938434225

Details

💛 - Coveralls

haystack
haystack copied to clipboard