langchain icon indicating copy to clipboard operation
langchain copied to clipboard

Lost in the middle: We have been ordering documents the WRONG way. (for long context)

Open GMartin-dev opened this issue 2 years ago • 1 comments
trafficstars

Motivation, it seems that when dealing with a long context and "big" number of relevant documents we must avoid using out of the box score ordering from vector stores. See: https://arxiv.org/pdf/2306.01150.pdf

So, I added an additional parameter that allows you to reorder the retrieved documents so we can work around this performance degradation. The relevance respect the original search score but accommodates the lest relevant document in the middle of the context. Extract from the paper (one image speaks 1000 tokens): image This seems to be common to all diff arquitectures. SO I think we need a good generic way to implement this reordering and run some test in our already running retrievers. It could be that my approach is not the best one from the architecture point of view, happy to have a discussion about that. For me this was the best place to introduce the change and start retesting diff implementations.

@rlancemartin, @eyurtsev

GMartin-dev avatar Jul 11 '23 06:07 GMartin-dev

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Jul 18, 2023 2:30pm

vercel[bot] avatar Jul 11 '23 06:07 vercel[bot]

i think it probably makes sense to implement this as a DocumentTransformer rather than putting it on the base VectorStore. happy to help refactor if you'd like

baskaryan avatar Jul 11 '23 20:07 baskaryan

i think it probably makes sense to implement this as a DocumentTransformer rather than putting it on the base VectorStore. happy to help refactor if you'd like

Yeap I started with that approach first then I realized that would imply re-write ton of code. Basically every place you have a single retriever you will have to re-implement it as a Compressor, apply the new DocumentTransformer / filter etc, etc. I was thinking that we need a more generic switch on /off for this since it's a pretty basic /low level tweak that affects all diff vector stores no matter what. In the case of going with a document transformer how do we implement in a way that is less intrusive or avoid forcing a cascade of refactor everywhere?

GMartin-dev avatar Jul 11 '23 21:07 GMartin-dev

@baskaryan Added it as a document transformer. I'm still not 100% sure this is the best approach. I mean it works, but will force using compressors everywhere when you need a retriever.

GMartin-dev avatar Jul 12 '23 21:07 GMartin-dev

@baskaryan Added it as a document transformer. I'm still not 100% sure this is the best approach.

I mean it works, but will force using compressors everywhere when you need a retriever.

Ya, we just added a new directory to make addition of new document_transformers more obvious, which I see that you used. Will have a review tomorrow!

rlancemartin avatar Jul 14 '23 05:07 rlancemartin

@baskaryan Added it as a document transformer. I'm still not 100% sure this is the best approach. I mean it works, but will force using compressors everywhere when you need a retriever.

Ya, we just added a new directory for document transformers, which I see that you used. Will have a review tomorrow!

I just refactored it! xD

GMartin-dev avatar Jul 14 '23 05:07 GMartin-dev

Also, in general: IIUC, this is a re-ordering of documents after retrieval (similar post-processing as something like Cohere re-rank). Re-rank is a vectorstore wrapper:

https://python.langchain.com/docs/modules/data_connection/retrievers/integrations/cohere-reranker

I'll have a look at the notebook tomorrow AM to get a better sense for usage.

rlancemartin avatar Jul 14 '23 05:07 rlancemartin

Also, in general: IIUC, this is a re-ordering of documents after retrieval (similar post-processing as something like Cohere re-rank). Re-rank is a vectorstore wrapper:

https://python.langchain.com/docs/modules/data_connection/retrievers/integrations/cohere-reranker

I'll have a look at the notebook tomorrow AM to get a better sense for usage.

Yeap this is my second implementation already it's hard to find the right place for this reordering. Currently as Document transformer, it's fuzzy for me the diff between transformer and compressors for this case... But since we do not need the query to run this reordering, it's totally query agnostic. DocumentTransformer will do?

off topic (but related): In case of a new simple embeddings reordering, for an scenario with merger retriever, check this comment: https://github.com/hwchase17/langchain/issues/3991#issuecomment-1609923453 In that case I think we need a DocumentCompressor since we need the query as input.

But in the case of the lost in the middle reordering we might need to apply it broadly before formatting the prompt, like for all context with more than 10 documents or something like that, since it's related to the model processing the prompt. Right now it seems all state of the art models have this issue but some new architecture in the future could not needed.

GMartin-dev avatar Jul 14 '23 06:07 GMartin-dev

Also, in general: IIUC, this is a re-ordering of documents after retrieval (similar post-processing as something like Cohere re-rank). Re-rank is a vectorstore wrapper: https://python.langchain.com/docs/modules/data_connection/retrievers/integrations/cohere-reranker I'll have a look at the notebook tomorrow AM to get a better sense for usage.

Yeap this is my second implementation already it's hard to find the right place for this reordering. Currently as Document transformer, it's fuzzy for me the diff between transformer and compressors for this case... But since we do not need the query to run this reordering, it's totally query agnostic. DocumentTransformer will do?

off topic (but related): In case of a new simple embeddings reordering, for an scenario with merger retriever, check this comment: #3991 (comment) In that case I think we need a DocumentCompressor since we need the query as input.

But in the case of the lost in the middle reordering we might need to apply it broadly before formatting the prompt, like for all context with more than 10 documents or something like that, since it's related to the model processing the prompt. Right now it seems all state of the art models have this issue but some new architecture in the future could not needed.

Ya, great point on DocumentCompressor vs DocumentTransformer. Compressors all have an explicit compress_documents method. In this case, you are right: no compression, so a DocumentTransformer is a better fit. But it's also a DocumentTransformer applied after retrieval. That's fine, the overall flow looks like this:

image

rlancemartin avatar Jul 17 '23 22:07 rlancemartin

Unrelated to this PR, but on the theme -

Do we have redundancy between EmbeddingsFilter here in document_compressors ...

... and EmbeddingsRedundantFilter here in document_transformers?

The second one went in with your earlier PR here but at the time I had not closely examined various options in document_compressors.

rlancemartin avatar Jul 17 '23 23:07 rlancemartin

Not sure if i'm following it here, EmbeddingsFilter and EmbeddingsRedundantFilter existed before right? The prev PR added EmbeddingsClusteringFilter. The clustering it's a mix of filtering / grouping / re-ordering all together. NOt sure if it fits on any of the olds ones really, diff parameters needed, etc.

GMartin-dev avatar Jul 18 '23 02:07 GMartin-dev

Not sure if i'm following it here, EmbeddingsFilter and EmbeddingsRedundantFilter existed before right?

The prev PR added EmbeddingsClusteringFilter. The clustering it's a mix of filtering / grouping / re-ordering all together. NOt sure if it fits on any of the olds ones really, diff parameters needed, etc.

Ok, good. I haven't looked closely at the implementation in compressors that existed previously, like you said. If there is no functional overlap, then no concern from me.

rlancemartin avatar Jul 18 '23 03:07 rlancemartin

Overall, this is great. Just add a notebook example of how to use this stand-alone w/ a vectostore and k > 5 to demonstrate larger-scale retrieval. And also poetry run black . for lint.

rlancemartin avatar Jul 18 '23 05:07 rlancemartin

Yeap the lint stuff it's weird I have black set by default on vscode... but is not catching all the same formatting issues it seems.

GMartin-dev avatar Jul 18 '23 05:07 GMartin-dev

Yeap the lint stuff it's weird I have black set by default on vscode... but is not catching all the same formatting issues it seems.

Looks great! Needed poetry run ruff . --fix to fix error (did it for you). make format should should run this.

This is good to go in once tests pass.

rlancemartin avatar Jul 18 '23 14:07 rlancemartin