haystack icon indicating copy to clipboard operation
haystack copied to clipboard

Proposal to add file similarity retriever to haystack

Open elundaeva opened this issue 1 year ago • 13 comments

Related Issues

Link to code PR - https://github.com/deepset-ai/haystack/pull/5666

Proposed Changes:

Proposal to add file similarity retriever to haystack

elundaeva avatar Aug 25 '23 13:08 elundaeva

This node uses composition with another node which is generally not something we would like for v2. Instead, we would probably prefer if the node was split in two separate nodes (e.g. a DocumentFetcher and a DocumentAggregator) that could provide the input and accept the output of a retriever in a pipeline. However, I can also see that having one node is probably easier for the user so it might make sense for now in v1. Still, I don't think there are any nodes currently that take another node as a parameter (document stores are not nodes so they don't count). @ZanSara do you perhaps have an opinion on this?

MichelBartels avatar Sep 01 '23 12:09 MichelBartels

This node uses composition with another node which is generally not something we would like for v2. Instead, we would probably prefer if the node was split in two separate nodes (e.g. a DocumentFetcher and a DocumentAggregator) that could provide the input and accept the output of a retriever in a pipeline. However, I can also see that having one node is probably easier for the user so it might make sense for now in v1.

@MichelBartels I've commented on this in many places and it was always said that v2 would also provide something like higher level components to make certain things easier. I think that in the case of this component it would be even more necessary to have that abstraction layer above. If you don't have it, then users need to figure to whole concept of file similarity retrieval out for themselves and piece everything together. Maybe a good time to start thinking about how that usable abstraction could look like in v2?

mathislucka avatar Sep 01 '23 14:09 mathislucka

@mathislucka I agree with you and we have decided at our co.work last week that without "subpipelines" or a similar abstraction the v2 concept of pipelines is not really usable. That's also why I was asking. It seems like it's a tradeoff between ease of porting it to v2 and usability while we're still using v1 pipelines.

MichelBartels avatar Sep 01 '23 14:09 MichelBartels

I confirm what Michel said about v2 "subpipelines": we have discussed the topic during our co:work and agreed that this abstraction is going to be present in v2 in several situations. So when implementing components we shouldn't worry about splitting them into smaller units, because later they can be grouped back into larger units that contain "subpipelines" made of such smaller components.

ZanSara avatar Sep 04 '23 10:09 ZanSara

Thank you for the input! 💯 In that case, @ZanSara @MichelBartels I can try to implement it as two separate sub-components:

  • DocumentFetcher that gets all docs corresponding to a file based on a provided aggregation meta key and returns a list of docs to be used by one or two retrievers as input for batch retrieval. &
  • DocumentAggregator placed after the retriever (or the JoinNode) that aggregates the document list(s) into a single ranked list of files (which are put together based on file_ids of retrieved docs) using RRF.

How does it sound?

elundaeva avatar Sep 04 '23 13:09 elundaeva

@elundaeva That sounds good to me. @ZanSara Just to confirm, would we then want some sort of abstraction for v1? Otherwise, it's not very usable until we release v2.

MichelBartels avatar Sep 04 '23 13:09 MichelBartels

This proposal is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions[bot] avatar Oct 21 '23 01:10 github-actions[bot]

Hi, Looking to reopen this proposal (or create a new one if necessary) of adding the FileSimilarityRetrieval now that v2 is in beta.

Summarizing the discussion till now

On Design Principle

  • We'd like to avoid composition of the components and ideally have many smaller components than one or few big ones. As such, the suggestion to create two components (DocumentFetcher, DocumentAggregator) instead of one (FileSimilarityRetriever).
  • The joining of such smaller components may be handled with subpipelines (details to be discussed)

On behavior

Compared to the custom FileSimilarityRetriever already available internally.

  • Return value: could be list of agg_key values instead of one doc per agg group
  • Would be good to pass list of retrievers instead of explicitly passing primary, secondary retrievers.

On links with other components

  • The FilterRetriever in v1 is very close to the DocumentFetcher being proposed. Differences:

    • The way the input is specified (direct meta value as query vs ignored query + full filter syntax)
    • The output format (list of docs vs possibly list of strings i.e. doc contents)
  • The DocumentJoiner could combine the docs from the different retrievers. It doesn't provide aggregation which would need to be another component. Also two undesired behavior:

    • It would overwrite rrf score for a document with its last appearance instead of adding them up.
    • It would overwrite the original score

Questions and way forward

I tried creating the pipeline with v2 in this Colab Notebook. There seems to be two main options.

For both we'd need a DocumentFetcher component. We could also name it FilterRetriever. But would need to settle on exact behavior (perhaps converging on/collapsing the differences above). Maybe could be a separate proposal?

Assuming we have this fetcher:

Route 1

A bigger scoped MetaDocumentAggregator that wraps the retrievers (so has component composition) and also performs the aggregation.

file_retriever_01

The notebook has a basic version of this.

Route 2

A smaller scoped MetaDocumentAggregator that only does aggregation and receives documents from the retrievers.

file_retriever_02

I couldn't get this to work. The retrievers expect str whilst we produce List[Document] or List[str] (with a shaper-like node). Also depending on the retriever we might want to pass content or embedding. As such this seems harder to implement?

Asides

  • Seems like this could be a more general (something like MetaDocSimilarityRetriever) rather than FileSimilarityRetriever as the grouping can be things apart from file too.
  • List of retrievers seem fine for Route 1 (at least in code) (?)

Next Step Questions

  • First propose and implement a "FilterRetriever"?
  • Go with Route 1 or figure out a way to make Route 2 work?
  • Are we okay to have variable output type that depends on init arguments? (List[Document] i.e. one from each meta-key grouping vs List[str] i.e. the meta-key values themselves)

bglearning avatar Jan 16 '24 18:01 bglearning

Hey @bglearning thanks for your work on this and for exploring the two different options you proposed.

Let's start with your step 1 so

First propose and implement a "FilterRetriever"?

but I think you can go ahead and implement it based off of the v1 version. No need for a full proposal since it's a known and understood feature of Haystack v1. In the PR we can revisit something like the name, but my feeling is that we should leave it as FilterRetriever and we may need to implement different versions based on the doc store. For example, we have now have an InMemoryEmbeddingRetriever and then a separate OpenSearchEmbeddingRetriever implemented in our haystack-core-integrations repo.

So we would want something like a InMemoryFilterRetriever and a OpenSearchFilterRetriever in the respective packages.

sjrl avatar Jan 17 '24 08:01 sjrl

I do agree with @sjrl here, no need for a new proposal to add a FilterRetriever.

Also I think it can be generic enough to work with every Document Store. The filter_documents() method is part of the common Document Store interface. So there's no need to have a FilterRetriever for each Document Store.

I couldn't get this to work. The retrievers expect str whilst we produce List[Document] or List[str] (with a shaper-like node). Also depending on the retriever we might want to pass content or embedding. As such this seems harder to implement?

@bglearning we're aware of this connection limitation and we're investigating possible solutions. The fastest solution would be changing the input types to accept both but that complicates a bit the logic. We're trying to come up with something better. 🤔

silvanocerza avatar Jan 17 '24 09:01 silvanocerza

Hi all, getting back to this with the FilterRetriever (#6836) now available, plus with latest developments around v2.

I updated the Colab Notebook above based on latest developments.

Now, for Route 2, with OutputAdapter (#6936) available there is a way to connect the results from the first (filter) retrieval to the subsequent retrievers. However, given the retrievers take one query, I think there isn't a way to run them on all resulting docs from the first retrieval(?). I only took the first of the documents to make it work here. Aside: it's cool that we can simply connect multiple outputs to a component with Variadic input.

As such, leaning towards the first approach (Route1) mentioned above i.e. with a MetaDocumentAggregator that has a list of retrievers. Please flag any issue that might cause (i.e. having list[<another_component>] inside a component. I think there won't be an issue with (de)-serialization. But can't think of other problems except of course that it goes somewhat against the multiple small components approach.

Next: I'll update the proposal to reflect this.

Route 1

Route 2

bglearning avatar Feb 16 '24 13:02 bglearning

As such, leaning towards the first approach (Route1) mentioned above i.e. with a MetaDocumentAggregator that has a list of retrievers.

Yeah I think I am also leaning towards this approach. I think using this it would be easier to perform checks on the incoming documents as well from the FilterRetriever in this case. For example, we will want to enforce that all incoming documents to the MetaDocumentAggregator all share the same value in the provided meta key (agg_key in your colab).

sjrl avatar Feb 19 '24 10:02 sjrl

Revisiting this as part of v2 migrations. Considering a third route (also again added to the Colab Notebook):

route3

To summarize:

We want to do two things based on resultant documents from the filter retriever:

  1. Run multiple retrievers for each document to get a set of similar documents.
  2. Group/aggregate the documents from 1 based on an agg_key

Route 1 above does both things inside a single component. Route 2 creates one component for 2 and tries to run 1 without a wrapper component for the retrievers but this doesn't seem to be possible. Route 3 creates one component for 1 (SimilarDocumentsRetriever) and one for 2 (DocumentMetaAggregator).

Leaning towards Route3 now as it results in smaller and possibly independently reusable components. E.g. "SimilarDocuments to a select group of documents from previous result" or "Aggregate/group the result from a query based on metadata using DocumentMetaAggregator"

bglearning avatar May 10 '24 11:05 bglearning

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions[bot] avatar Jun 10 '24 01:06 github-actions[bot]