haystack
haystack copied to clipboard
Proposal to add file similarity retriever to haystack
Related Issues
Link to code PR - https://github.com/deepset-ai/haystack/pull/5666
Proposed Changes:
Proposal to add file similarity retriever to haystack
This node uses composition with another node which is generally not something we would like for v2. Instead, we would probably prefer if the node was split in two separate nodes (e.g. a DocumentFetcher
and a DocumentAggregator
) that could provide the input and accept the output of a retriever in a pipeline. However, I can also see that having one node is probably easier for the user so it might make sense for now in v1.
Still, I don't think there are any nodes currently that take another node as a parameter (document stores are not nodes so they don't count). @ZanSara do you perhaps have an opinion on this?
This node uses composition with another node which is generally not something we would like for v2. Instead, we would probably prefer if the node was split in two separate nodes (e.g. a DocumentFetcher and a DocumentAggregator) that could provide the input and accept the output of a retriever in a pipeline. However, I can also see that having one node is probably easier for the user so it might make sense for now in v1.
@MichelBartels I've commented on this in many places and it was always said that v2 would also provide something like higher level components to make certain things easier. I think that in the case of this component it would be even more necessary to have that abstraction layer above. If you don't have it, then users need to figure to whole concept of file similarity retrieval out for themselves and piece everything together. Maybe a good time to start thinking about how that usable abstraction could look like in v2?
@mathislucka I agree with you and we have decided at our co.work last week that without "subpipelines" or a similar abstraction the v2 concept of pipelines is not really usable. That's also why I was asking. It seems like it's a tradeoff between ease of porting it to v2 and usability while we're still using v1 pipelines.
I confirm what Michel said about v2 "subpipelines": we have discussed the topic during our co:work and agreed that this abstraction is going to be present in v2 in several situations. So when implementing components we shouldn't worry about splitting them into smaller units, because later they can be grouped back into larger units that contain "subpipelines" made of such smaller components.
Thank you for the input! 💯 In that case, @ZanSara @MichelBartels I can try to implement it as two separate sub-components:
- DocumentFetcher that gets all docs corresponding to a file based on a provided aggregation meta key and returns a list of docs to be used by one or two retrievers as input for batch retrieval. &
- DocumentAggregator placed after the retriever (or the JoinNode) that aggregates the document list(s) into a single ranked list of files (which are put together based on file_ids of retrieved docs) using RRF.
How does it sound?
@elundaeva That sounds good to me. @ZanSara Just to confirm, would we then want some sort of abstraction for v1? Otherwise, it's not very usable until we release v2.
This proposal is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 10 days.
Hi, Looking to reopen this proposal (or create a new one if necessary) of adding the FileSimilarityRetrieval now that v2 is in beta.
Summarizing the discussion till now
On Design Principle
- We'd like to avoid composition of the components and ideally have many smaller components than one or few big ones. As such, the suggestion to create two components (
DocumentFetcher
,DocumentAggregator
) instead of one (FileSimilarityRetriever
). - The joining of such smaller components may be handled with subpipelines (details to be discussed)
On behavior
Compared to the custom FileSimilarityRetriever
already available internally.
- Return value: could be list of agg_key values instead of one doc per agg group
- Would be good to pass list of retrievers instead of explicitly passing primary, secondary retrievers.
On links with other components
-
The
FilterRetriever
in v1 is very close to theDocumentFetcher
being proposed. Differences:- The way the input is specified (direct meta value as query vs ignored query + full filter syntax)
- The output format (list of docs vs possibly list of strings i.e. doc contents)
-
The
DocumentJoiner
could combine the docs from the different retrievers. It doesn't provide aggregation which would need to be another component. Also two undesired behavior:- It would overwrite rrf score for a document with its last appearance instead of adding them up.
- It would overwrite the original score
Questions and way forward
I tried creating the pipeline with v2 in this Colab Notebook. There seems to be two main options.
For both we'd need a DocumentFetcher
component. We could also name it FilterRetriever
. But would need to settle on exact behavior (perhaps converging on/collapsing the differences above). Maybe could be a separate proposal?
Assuming we have this fetcher:
Route 1
A bigger scoped MetaDocumentAggregator
that wraps the retrievers (so has component composition) and also performs the aggregation.
The notebook has a basic version of this.
Route 2
A smaller scoped MetaDocumentAggregator
that only does aggregation and receives documents from the retrievers.
I couldn't get this to work. The retrievers expect str
whilst we produce List[Document]
or List[str]
(with a shaper-like node). Also depending on the retriever we might want to pass content
or embedding
. As such this seems harder to implement?
Asides
- Seems like this could be a more general (something like
MetaDocSimilarityRetriever
) rather thanFileSimilarityRetriever
as the grouping can be things apart from file too. - List of retrievers seem fine for Route 1 (at least in code) (?)
Next Step Questions
- First propose and implement a "FilterRetriever"?
- Go with Route 1 or figure out a way to make Route 2 work?
- Are we okay to have variable output type that depends on init arguments? (List[Document] i.e. one from each meta-key grouping vs List[str] i.e. the meta-key values themselves)
Hey @bglearning thanks for your work on this and for exploring the two different options you proposed.
Let's start with your step 1 so
First propose and implement a "FilterRetriever"?
but I think you can go ahead and implement it based off of the v1 version. No need for a full proposal since it's a known and understood feature of Haystack v1. In the PR we can revisit something like the name, but my feeling is that we should leave it as FilterRetriever
and we may need to implement different versions based on the doc store. For example, we have now have an InMemoryEmbeddingRetriever
and then a separate OpenSearchEmbeddingRetriever
implemented in our haystack-core-integrations repo.
So we would want something like a InMemoryFilterRetriever
and a OpenSearchFilterRetriever
in the respective packages.
I do agree with @sjrl here, no need for a new proposal to add a FilterRetriever
.
Also I think it can be generic enough to work with every Document Store. The filter_documents()
method is part of the common Document Store interface. So there's no need to have a FilterRetriever
for each Document Store.
I couldn't get this to work. The retrievers expect
str
whilst we produceList[Document]
orList[str]
(with a shaper-like node). Also depending on the retriever we might want to passcontent
orembedding
. As such this seems harder to implement?
@bglearning we're aware of this connection limitation and we're investigating possible solutions. The fastest solution would be changing the input types to accept both but that complicates a bit the logic. We're trying to come up with something better. 🤔
Hi all, getting back to this with the FilterRetriever
(#6836) now available, plus with latest developments around v2.
I updated the Colab Notebook above based on latest developments.
Now, for Route 2, with OutputAdapter
(#6936) available there is a way to connect the results from the first (filter) retrieval to the subsequent retrievers. However, given the retrievers take one query, I think there isn't a way to run them on all resulting docs from the first retrieval(?). I only took the first of the documents to make it work here. Aside: it's cool that we can simply connect multiple outputs to a component with Variadic
input.
As such, leaning towards the first approach (Route1) mentioned above i.e. with a MetaDocumentAggregator
that has a list of retrievers. Please flag any issue that might cause (i.e. having list[<another_component>]
inside a component. I think there won't be an issue with (de)-serialization. But can't think of other problems except of course that it goes somewhat against the multiple small components approach.
Next: I'll update the proposal to reflect this.
Route 1
Route 2
As such, leaning towards the first approach (Route1) mentioned above i.e. with a MetaDocumentAggregator that has a list of retrievers.
Yeah I think I am also leaning towards this approach. I think using this it would be easier to perform checks on the incoming documents as well from the FilterRetriever in this case. For example, we will want to enforce that all incoming documents to the MetaDocumentAggregator all share the same value in the provided meta key (agg_key
in your colab).
Revisiting this as part of v2 migrations. Considering a third route (also again added to the Colab Notebook):
To summarize:
We want to do two things based on resultant documents from the filter retriever:
- Run multiple retrievers for each document to get a set of similar documents.
- Group/aggregate the documents from 1 based on an agg_key
Route 1 above does both things inside a single component.
Route 2 creates one component for 2 and tries to run 1 without a wrapper component for the retrievers but this doesn't seem to be possible.
Route 3 creates one component for 1 (SimilarDocumentsRetriever
) and one for 2 (DocumentMetaAggregator
).
Leaning towards Route3 now as it results in smaller and possibly independently reusable components. E.g. "SimilarDocuments to a select group of documents from previous result" or "Aggregate/group the result from a query based on metadata using DocumentMetaAggregator"
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 10 days.