langchain
langchain copied to clipboard
Does langchain support using Contriever as an embedding method?
It is great to see langchain already support HyDE. But in its original paper, once the hypothetical documents are generated, the embedding is computed using Contriever model as described in the HyDE official repo (https://github.com/texttron/hyde). Can I ask how should I enable using Contriever instead of using OpenAI embeddings? Thank you.
i looked into this breifly - i actually struggled to load it from huggingface. eg i ran
from transformers import AutoTokenizer, Contriever
tokenizer = AutoTokenizer.from_pretrained("facebook/contriever")
model = Contriever.from_pretrained("facebook/contriever")
as per https://huggingface.co/facebook/contriever but it didnt work. does this work for you?
Yeah, it works for me, I can run its demo without any problem.
from src.contriever import Contriever
from transformers import AutoTokenizer
contriever = Contriever.from_pretrained("facebook/contriever")
tokenizer = AutoTokenizer.from_pretrained("facebook/contriever") #Load the associated tokenizer:
sentences = [
"Where was Marie Curie born?",
"Maria Sklodowska, later known as Marie Curie, was born on November 7, 1867.",
"Born in Paris on 15 May 1859, Pierre Curie was the son of Eugène Curie, a doctor of French Catholic origin from Alsace."
]
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
embeddings = contriever(**inputs)
score01 = embeddings[0] @ embeddings[1] #1.0473
score02 = embeddings[0] @ embeddings[2] #1.0095
print(score01, score02)
# tensor(1.0473, grad_fn=<DotBackward0>) tensor(1.0095, grad_fn=<DotBackward0>)
My python env is 3.10.9, pytorch version is 1.13.1-cuda117, and transformer version is 4.25.1. I can also run the demo in my M1 mac without problems.
It is great to see langchain already support HyDE. But in its original paper, once the hypothetical documents are generated, the embedding is computed using Contriever model as described in the HyDE official repo (https://github.com/texttron/hyde). Can I ask how should I enable using Contriever instead of using OpenAI embeddings? Thank you.
Does is actually make any significant difference between using OpenAI embedding vs Contriever?
Btw, something I additional notice on the langchain implementation of Hyde, is that it’s only considering the hypothetical docs to calculate the final vector and not considering the embedding of the origin query/question to be part of the group of vectors to get the mean from.
@hwchase17 @bryanyzhu How are HyDE embeddings useful, I am going through the doc here https://langchain.readthedocs.io/en/latest/modules/indexes/examples/hyde.html, but I don't find the implementation any different from OpenAI embeddings, the hypothetical embeddings stored in result variable are not used for the further process?
Hi, @bryanyzhu! I'm Dosu, and I'm helping the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.
From what I understand, you were asking if LangChain supports using Contriever as an embedding method instead of OpenAI embeddings. In the comments, there was a discussion about loading Contriever from Hugging Face and running a demo successfully. There were also questions about the difference between using OpenAI embeddings and Contriever embeddings, as well as the usefulness of HyDE embeddings.
Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself or it will be automatically closed in 7 days.
Thank you for your contribution to the LangChain repository!
Contriever works well with the example above (from the huggingface contriever page), and the issue still occurs with the latest LangChain version. It will be great to add support for it.