Improve `_check_docstore_similarity_function` function in `_BaseEmbeddingEncoder`
Is your feature request related to a problem? Please describe.
It can be very difficult for users to know what similarity functions (cosine or dot_product) are appropriate for a given Embedding model.
This is why we have the _check_docstore_similarity_function to alert users when the incorrect similarity function is specified in the DocumentStore when paired with a specific Retriever model.
The current implementation of _check_docstore_similarity_function is https://github.com/deepset-ai/haystack/blob/94f660c56f1dcf643a56f9555008d6e65e4995f9/haystack/nodes/retriever/_embedding_encoder.py#L93-L119
Describe the solution you'd like I would like to suggest a few improvements to the function:
- Use the same model detection for
sentence-transformersas we do inEmbeddingRetriever._infer_model_formathttps://github.com/deepset-ai/haystack/blob/94f660c56f1dcf643a56f9555008d6e65e4995f9/haystack/nodes/retriever/dense.py#L1823-L1836 so we rely on the correct configuration file being present rather than relying on the name of the model to includesentence-transformers. - Most of the
sentence-transformersmodels do not include the name of the appropriate similarity function in their model name, which is how we currently detect the similarity function for the model. For example, two models we commonly useall-mpnet-base-v2andparaphrase-multilingual-mpnet-base-v2don't use this type of naming convention. To solve this I would like to suggest using the information provided in this table by thesentence-transformerslibrary, which records the suitable scoring functions for all the of the pre-trained models provided by the library.
Describe alternatives you've considered
For point 2 I tried to figure out if the configuration files present in the HuggingFace Hub for sentence-transformers models (e.g. https://huggingface.co/sentence-transformers/all-mpnet-base-v2/tree/main) contained the information about the suitable scoring functions, but I could not find anything.
Additional context Getting the scoring function correct is very important for performance. We find that models that do not normalize their embeddings and are trained on either scoring function can drop up to 20% in recall metrics when the wrong scoring function is used in the document store.
This proposal makes sense.
About point 2, in the past, I too struggled with finding this information. The link you provided is a step forward. Unfortunately, it contains 39 models while in the HF Hub, there are 124 Sentence Transformers models.
In my opinion, we should open an issue somewhere on HuggingFace to highlight the importance of the score function. It should be indicated in the model card or somewhere else you can easily find.
@sjrl WDYT?
@sjrl Going through the backlog we came across this older issue. Is it still as relevant? What about you or @anakin87 create an issue in the HuggingFace transformers repo?
Hey @julian-risch yeah I think this is still really relevant. For example, I think I needed to go to the paper and github repo of the recent e5 models from Microsoft to find out what similarity should be used for them (cosine in this case).