haystack icon indicating copy to clipboard operation
haystack copied to clipboard

Support for new embedding models

Open anakin87 opened this issue 2 years ago • 4 comments

Today there are new open-source embedding models that work better than sentence-transformers and in some cases are superior to OpenAI ones.

Sometimes Haystack users asked to support these new models (#4051 and #4946).

It would be good to explore what we need to do to support these new models in Haystack.

Side note: probably supporting instructor family of models could require several changes because they tailor the embeddings to the task using a prompt; supporting e5 models should be easier...

anakin87 avatar Jun 30 '23 14:06 anakin87

@sjrl already found a way to make e5 work. One thing that we could improve upon though is that e5 requires documents to be prefixed with passage: and queries with question:. Would be great if that could be added somehow.

mathislucka avatar Jul 03 '23 07:07 mathislucka

Yes I found that we can load e5 in Haystack, but we are unable to easily add the prefixes that Mathis mentioned. Even without the prefixes it already works quite well, but we are probably losing out on some performance by not using them.

sjrl avatar Jul 03 '23 07:07 sjrl

@sjrl, thanks for the clarification. Could you post a code example of using e5 embeddings in Haystack?

anakin87 avatar Jul 03 '23 08:07 anakin87

Here is a minimal example of how to load the embedding retriever. You can then use it as you normally would an Embedding Retriever.

NOTE: Make sure to use the cosine similarity function for these embeddings in the document store.

from haystack.nodes import EmbeddingRetriever
from haystack.document_stores import InMemoryDocumentStore

doc_store = InMemoryDocumentStore(
    similarity="cosine",  # the e5 models were trained with a cosine similarity function
    embedding_dim=768
)

e5 = EmbeddingRetriever(
    document_store=doc_store,
    embedding_model="intfloat/e5-base-v2",
    model_format="transformers",  # Make sure we specify the transformers model format
    pooling_strategy="reduce_mean",  # This is the pooling method used to train the e5 models
    top_k=20,
    max_seq_len=512,
)
doc_store.update_embeddings(e5)

sjrl avatar Jul 03 '23 08:07 sjrl

@julian-risch how can we use EmbeddingRetriever to finetune E5 models via retriever.train() similar as in this tutorial? This docstrings reads as We only support the training of sentence-transformer embedding models.. Does that mean we cannot finetune an E5 model using EmbeddingRetriever class?

In addition is there a utility script to create a dataset in this format ?

      * question: the question string
        * pos_doc: the positive document string
        * neg_doc: the negative document string
        * score: the score margin

thanks.

rnyak avatar Aug 30 '23 22:08 rnyak

I would like to add support for INSTRUCTOR Embedding Models. I have opened a PR (#5836) that adds INSTRUCTOR to Haystack (v2).

The implementation is very similar to the implementation for the Sentence Transformers Embedding Models (https://github.com/deepset-ai/haystack/issues/5567).

awinml avatar Sep 19 '23 08:09 awinml