haystack
haystack copied to clipboard
Support for new embedding models
Today there are new open-source embedding models that work better than sentence-transformers and in some cases are superior to OpenAI ones.
Sometimes Haystack users asked to support these new models (#4051 and #4946).
It would be good to explore what we need to do to support these new models in Haystack.
Side note: probably supporting instructor family of models could require several changes because they tailor the embeddings to the task using a prompt; supporting e5 models should be easier...
@sjrl already found a way to make e5 work. One thing that we could improve upon though is that e5 requires documents to be prefixed with passage: and queries with question:. Would be great if that could be added somehow.
Yes I found that we can load e5 in Haystack, but we are unable to easily add the prefixes that Mathis mentioned. Even without the prefixes it already works quite well, but we are probably losing out on some performance by not using them.
@sjrl, thanks for the clarification. Could you post a code example of using e5 embeddings in Haystack?
Here is a minimal example of how to load the embedding retriever. You can then use it as you normally would an Embedding Retriever.
NOTE: Make sure to use the cosine similarity function for these embeddings in the document store.
from haystack.nodes import EmbeddingRetriever
from haystack.document_stores import InMemoryDocumentStore
doc_store = InMemoryDocumentStore(
similarity="cosine", # the e5 models were trained with a cosine similarity function
embedding_dim=768
)
e5 = EmbeddingRetriever(
document_store=doc_store,
embedding_model="intfloat/e5-base-v2",
model_format="transformers", # Make sure we specify the transformers model format
pooling_strategy="reduce_mean", # This is the pooling method used to train the e5 models
top_k=20,
max_seq_len=512,
)
doc_store.update_embeddings(e5)
@julian-risch how can we use EmbeddingRetriever to finetune E5 models via retriever.train() similar as in this tutorial? This docstrings reads as We only support the training of sentence-transformer embedding models.. Does that mean we cannot finetune an E5 model using EmbeddingRetriever class?
In addition is there a utility script to create a dataset in this format ?
* question: the question string
* pos_doc: the positive document string
* neg_doc: the negative document string
* score: the score margin
thanks.
I would like to add support for INSTRUCTOR Embedding Models. I have opened a PR (#5836) that adds INSTRUCTOR to Haystack (v2).
The implementation is very similar to the implementation for the Sentence Transformers Embedding Models (https://github.com/deepset-ai/haystack/issues/5567).