haystack
haystack copied to clipboard
perf: enhanced `InMemoryDocumentStore` BM25 query efficiency with incremental indexing
Related Issues
This proposal was first made as a stand-alone Haystack document store integration, which is linked to issue number 218 in haystack-integrations repo.
Proposed Changes:
Instead of reindexing with every new query, I choose to perform incremental indexing on document changes. This results in modifications primarily to write_documents
, delet_documents
, and bm25_retrieval
.
How did you test it?
As suggested by @julian-risch, the change should be non-breaking. Therefore, the test was performed with test cases implemented in test/document_stores/test_in_memory.py
. 81 test cases passed and 3 cases failed with explainable causes:
-
TestMemoryDocumentStore::test_from_dict
:self.bm25_algorithm
now points to the string literal of the algorithm name, instead of aBM25
object. So, it does not have the.__name__
attribute. -
TestMemoryDocumentStore::test_bm25_retrieval_with_non_scaled_BM25Okapi
: this is caused by the pytest fixture initializing a BM25L document store and the test case modified the underlying algorithm not from initializer, making the underlying algorithm being BM25L instead of Okapi BM25. Changing the initialized algorithm will result in a pass. -
TestMemoryDocumentStore::test_bm25_retrieval_with_text_and_table_content
: the non-matching documents have tied scores. The test case got a "lucky pass" because NumPy quick-sort alters the document orders even when the scores are the same.
Notes for the reviewer
Any suggestion is appreciated :)
Checklist
- I have read the contributors guidelines and the code of conduct
- I have updated the related issue with new insights and changes
- I added unit tests and updated the docstrings
- I've used one of the conventional commit types for my PR title:
fix:
,feat:
,build:
,chore:
,ci:
,docs:
,style:
,refactor:
,perf:
,test:
. - I documented my code
- I ran pre-commit hooks and fixed any issue
@Guest400123064 Thank you for opening this PR! We really appreciate it. Our team will need a little bit more time to review your PR. Having had a first quick look, I think we can remove the haystack_bm25 dependency from the project here and remove the import also from the tests here if it is not used anymore in this single test.
Thanks for the reply! Yea, theoretically it should completely replicate rank_bm25
; I haven't done an extensive exact comparison, e.g, with fake data generated by hypothesis
. But I am wondering if I should directly benchmark the retrieval performance instead of trying to match rank_bm25
.
Hi @Guest400123064, thanks for your contribution, this is very good work! I left some initial suggestions.
Pull Request Test Coverage Report for Build 8938434225
Details
- 0 of 0 changed or added relevant lines in 0 files are covered.
- 5 unchanged lines in 1 file lost coverage.
- Overall coverage increased (+0.2%) to 90.333%
Files with Coverage Reduction | New Missed Lines | % |
---|---|---|
document_stores/in_memory/document_store.py | 5 | 98.04% |
<!-- | Total: | 5 |
Totals | |
---|---|
Change from base Build 8937849375: | 0.2% |
Covered Lines: | 6513 |
Relevant Lines: | 7210 |