haystack EmbeddingRetriever does not account for longer documents

Describe the bug The EmbeddingRetriever does not account for long sequences. More precisely, it passes on the documents to the underlying encoder model (sentence-transformer) which truncates the sequence before embedding.

As such, i) The embedding isn't based on the full document and ii) Multiple documents can have the same embedding if they have the same starting sequence up to the max-len of the model.

Note: Also the case for tableQA

Expected behavior

Short-term: a relevant warning message.
Long-term: some way to natively handle long documents.

From: @sjrl

I believe as @ju-gu mentioned it would be helpful to add a warning message when passing documents to the embedding retriever that are longer than the max_seq_length supported by the loaded Embedding model. This would be very similar to the warning that is thrown by the FARMReader (I think just during training) which warns the user that the texts being passed to the reader are being truncated.

We also discussed offline in Slack alternatives to using truncation. For example, for each long documents we could split into smaller text chunks, pass each chunk into the EmbeddingRetriever, and then pool (e.g. mean or max) the embeddings together. This post here explains the concept well but falls short in actually evaluating how well the new embedding vectors perform.

This would allow us to create a single embedding that "represents" the whole document. However, as @mathislucka pointed out this is not how the embedding retriever models were trained and does not seem to have good benchmarks for testing this in NLP communities.

Additional resources:

Discussion in Sentence Transformers about this topic where they mention that they also use something like mean pooling for larger texts: https://github.com/UKPLab/sentence-transformers/issues/364#issuecomment-706632145

Longformer: The Long-Document Transformer Another potential alternative could be to support the Longformer which was built to embed longer documents. This model type is supported in HuggingFace and the docs page can be found for it here: https://huggingface.co/docs/transformers/model_doc/longformer

BigBird was also developed to handle longer texts. HuggingFace link here: https://huggingface.co/docs/transformers/model_doc/big_bird

To Reproduce This Colab Notebook.

FAQ Check

[x] Have you had a look at our new FAQ page?

Sep 19 '22 14:09 bglearning

Hi, Any update on this issue?

Feb 23 '23 12:02 liorshk

I can share some intuition why the idea of pooling won't work well for long documents. First, pooling would ignore the order of the chunks/words. This is not a big concern if the chunks are not just a single token or a few tokens but hundreds of tokens. However, in the linked post, pooling is also used for calculating one embedding for all the words in one chunk. In that case, the more words are used in one pool, the more generic the resulting embedding will be. And this effect will increase if we later on pool also the embeddings of each chunk. As an example, imagine you have a long book with thousands of words. If you calculate the embedding of each word and then average all of them, the resulting document embedding will be very generic and similar to the embedding of many other books - simply because they have so many words in common. Stop words will have the biggest effect on the document embedding and the document embeddings will be hardly usable.

Mar 13 '23 15:03 julian-risch

Hi, I am working on a use case that involves longer context length documents and have been experimenting with the embedding retriever for this. The contents are truncated at 512 tokens and because of this RAG pipeline is not performing well. Do you have any other work around like processing the documents in batches or using some of the embedding algorithms for longer documents (mentioned above)?

Nov 16 '23 17:11 Venkatesh-Balavadani