haystack feat: Add component SimilarDocumentsRetriever

Related Issues

Evolved from #5629 and #5666

Proposed Changes:

Addition of a new component SimilarDocumentsRetriever.

This component retrieves similar documents for each of the given documents for each preset retrievers. So, in a way it's simply a wrapper around multiple retrievers to be run on multiple documents as queries.

Usage Example:

from haystack import Document
from haystack.components.retrievers import SimilarDocumentsRetriever
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.document_stores.in_memory import InMemoryDocumentStore

docs = [
    Document(content="Javascript is a popular programming language"),
    Document(content="Python is a popular programming language"),
    Document(content="A chromosome is a package of DNA"),
    Document(content="DNA carries genetic information"),
]

doc_store = InMemoryDocumentStore()
doc_store.write_documents(docs)
retriever = InMemoryBM25Retriever(doc_store, top_k=1)
sim_docs_retriever = SimilarDocumentsRetriever(retrievers=[retriever])

result = sim_docs_retriever.run(
    documents=[
        Document(content="A DNA document"),
        Document(content="A programming document"),
    ]
)

print(result["document_lists"])
# [
#   [Document(..., content: 'DNA carries genetic information', ...)],
#   [Document(..., content: 'Javascript is a popular programming language', ...)]
# ]

Background

It was conceptualized when considering the addition of FileSimilarityRetriever here. There, it's one of the components that come together to provide a file similarity functionality. Route 3 in the demonstrative Colab Notebook there.

But this component by itself should possibly be useful for other use-cases too. E.g. finding similar sets of documents to the current output set from a DocSearch pipeline.

How did you test it?

Mostly unit tests, one integration test.

Notes for the reviewer

Open to feedback at all levels, including if there could be another way to have this functionality.

Some concrete big and small things I'm not sure of:

Should this be reformulated as a different component?

Should it be more clear from the naming that it's a wrapper for retriever(s) and not itself a retriever?
Should this be generalized a bit more? E.g. Instead of just List[Document], also accept List[str] for added flexibility. And rename component to something like BatchedRetriever or GroupRetriever or MultiRetriever, although each name seems misleading in their own way.

Related to the last point, I'm a bit unsure what's the policy for flexible input and output types right now? E.g. accepting List[Document] as well as List[str] or output format (List[List[Document]] vs List[List[List[Document]]] or something else) based on a preset init argument.

Minor: Unsure what the type hint for the init argument retrievers should be? As (afaik) there is no general Retriever interface. Right now it's just List.

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added unit tests and updated the docstrings
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:.
I documented my code
I ran pre-commit hooks and fixed any issue

May 23 '24 10:05 bglearning

Pull Request Test Coverage Report for Build 9206339010

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.05%) to 90.612%

Totals
Change from base Build 9204054527:	0.05%
Covered Lines:	6631
Relevant Lines:	7318

💛 - Coveralls

May 23 '24 10:05 coveralls

A component taking other components in its constructor suggests a design problem somewhere

Yes, it does feel off. Essentially, as you say, it comes about because of the "batching" problem.

When trying to do it directly without this wrapper (Route 2 mentioned here) (after using an OutputAdapter for prob.1 document.content -> query) that issue (prob.2) became the roadblock.

May 23 '24 15:05 bglearning

@masci On the path forward, as you say, looks like better to solve the "batching" problem first.

we attempted to solve this problem in a generic way but we never had the time to finish, maybe we should invest there?

Where is it under discussion? just to know where we can follow and/or contribute. Can't seem to find the right keywords to search.

May 24 '24 09:05 bglearning

This probably evolved into a custom component. I'm closing the PR now.

Feel free to reopen in the future if having this feature in Haystack is significant.

Oct 18 '24 14:10 anakin87

haystack haystack copied to clipboard

feat: Add component SimilarDocumentsRetriever

Related Issues

Proposed Changes:

Background

How did you test it?

Notes for the reviewer

Checklist

Pull Request Test Coverage Report for Build 9206339010

Details

💛 - Coveralls

haystack
haystack copied to clipboard