haystack
haystack copied to clipboard
feat: Add component SimilarDocumentsRetriever
Related Issues
Evolved from #5629 and #5666
Proposed Changes:
Addition of a new component SimilarDocumentsRetriever.
This component retrieves similar documents for each of the given documents for each preset retrievers. So, in a way it's simply a wrapper around multiple retrievers to be run on multiple documents as queries.
Usage Example:
from haystack import Document
from haystack.components.retrievers import SimilarDocumentsRetriever
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.document_stores.in_memory import InMemoryDocumentStore
docs = [
Document(content="Javascript is a popular programming language"),
Document(content="Python is a popular programming language"),
Document(content="A chromosome is a package of DNA"),
Document(content="DNA carries genetic information"),
]
doc_store = InMemoryDocumentStore()
doc_store.write_documents(docs)
retriever = InMemoryBM25Retriever(doc_store, top_k=1)
sim_docs_retriever = SimilarDocumentsRetriever(retrievers=[retriever])
result = sim_docs_retriever.run(
documents=[
Document(content="A DNA document"),
Document(content="A programming document"),
]
)
print(result["document_lists"])
# [
# [Document(..., content: 'DNA carries genetic information', ...)],
# [Document(..., content: 'Javascript is a popular programming language', ...)]
# ]
Background
It was conceptualized when considering the addition of FileSimilarityRetriever here. There, it's one of the components that come together to provide a file similarity functionality. Route 3 in the demonstrative Colab Notebook there.
But this component by itself should possibly be useful for other use-cases too. E.g. finding similar sets of documents to the current output set from a DocSearch pipeline.
How did you test it?
Mostly unit tests, one integration test.
Notes for the reviewer
Open to feedback at all levels, including if there could be another way to have this functionality.
Some concrete big and small things I'm not sure of:
Should this be reformulated as a different component?
- Should it be more clear from the naming that it's a wrapper for retriever(s) and not itself a retriever?
- Should this be generalized a bit more? E.g. Instead of just
List[Document], also acceptList[str]for added flexibility. And rename component to something likeBatchedRetrieverorGroupRetrieverorMultiRetriever, although each name seems misleading in their own way.
Related to the last point, I'm a bit unsure what's the policy for flexible input and output types right now? E.g. accepting List[Document] as well as List[str] or output format (List[List[Document]] vs List[List[List[Document]]] or something else) based on a preset init argument.
Minor: Unsure what the type hint for the init argument retrievers should be? As (afaik) there is no general Retriever interface. Right now it's just List.
Checklist
- I have read the contributors guidelines and the code of conduct
- I have updated the related issue with new insights and changes
- I added unit tests and updated the docstrings
- I've used one of the conventional commit types for my PR title:
fix:,feat:,build:,chore:,ci:,docs:,style:,refactor:,perf:,test:. - I documented my code
- I ran pre-commit hooks and fixed any issue
Pull Request Test Coverage Report for Build 9206339010
Details
- 0 of 0 changed or added relevant lines in 0 files are covered.
- No unchanged relevant lines lost coverage.
- Overall coverage increased (+0.05%) to 90.612%
| Totals | |
|---|---|
| Change from base Build 9204054527: | 0.05% |
| Covered Lines: | 6631 |
| Relevant Lines: | 7318 |
💛 - Coveralls
A component taking other components in its constructor suggests a design problem somewhere
Yes, it does feel off. Essentially, as you say, it comes about because of the "batching" problem.
When trying to do it directly without this wrapper (Route 2 mentioned here) (after using an OutputAdapter for prob.1 document.content -> query) that issue (prob.2) became the roadblock.
@masci On the path forward, as you say, looks like better to solve the "batching" problem first.
we attempted to solve this problem in a generic way but we never had the time to finish, maybe we should invest there?
Where is it under discussion? just to know where we can follow and/or contribute. Can't seem to find the right keywords to search.
This probably evolved into a custom component. I'm closing the PR now.
Feel free to reopen in the future if having this feature in Haystack is significant.