langchain
langchain copied to clipboard
Feature Request: Allow initializing HuggingFaceEmbeddings from the cached weight
Motivation
Right now, HuggingFaceEmbeddings doesn't support loading an embedding model's weights from the cache but downloading the weights every time. Fixing this would be a low hanging fruit by allowing the user to pass their cache directory.
Suggestion
The only change has only a few lines in init()
class HuggingFaceEmbeddings(BaseModel, Embeddings):
"""Wrapper around sentence_transformers embedding models.
To use, you should have the ``sentence_transformers`` python package installed.
Example:
.. code-block:: python
from langchain.embeddings import HuggingFaceEmbeddings
model_name = "sentence-transformers/all-mpnet-base-v2"
hf = HuggingFaceEmbeddings(model_name=model_name)
"""
client: Any #: :meta private:
model_name: str = DEFAULT_MODEL_NAME
"""Model name to use."""
def __init__(self, cache_folder=None, **kwargs: Any):
"""Initialize the sentence_transformer."""
super().__init__(**kwargs)
try:
import sentence_transformers
self.client = sentence_transformers.SentenceTransformer(model_name_or_path=self.model_name, cache_folder=cache_folder)
except ImportError:
raise ValueError(
"Could not import sentence_transformers python package. "
"Please install it with `pip install sentence_transformers`."
)
class Config:
"""Configuration for this pydantic object."""
extra = Extra.forbid
def embed_documents(self, texts: List[str]) -> List[List[float]]:
"""Compute doc embeddings using a HuggingFace transformer model.
Args:
texts: The list of texts to embed.
Returns:
List of embeddings, one for each text.
"""
texts = list(map(lambda x: x.replace("\n", " "), texts))
embeddings = self.client.encode(texts)
return embeddings.tolist()
def embed_query(self, text: str) -> List[float]:
"""Compute query embeddings using a HuggingFace transformer model.
Args:
text: The text to embed.
Returns:
Embeddings for the text.
"""
text = text.replace("\n", " ")
embedding = self.client.encode(text)
return embedding.tolist()
I am eager to learn how the solution to this problem is approached. Can you tell me where the weights are located and how they are downloaded? I am a beginner and am excited to see the solution, but I will only contribute if I have a better understanding of the process since I have limited experience in machine learning engineering.
I am eager to learn how the solution to this problem is approached. Can you tell me where the weights are located and how they are downloaded? I am a beginner and am excited to see the solution, but I will only contribute if I have a better understanding of the process since I have limited experience in machine learning engineering.
Thanks for your quick response. The weight would be downloaded if the user doesn't specify the cache_folder
, and initializes SentenceTransformer()
(from the python package sequence_transformer
, directly
By default, if no cache_folder
is given, SequenceTransformer
will search the weights from the directory SENTENCE_TRANSFORMERS_HOME
, see here, if the weights are not found, it will download the weights from huggingface hub.
so the alternative for users without changing the LangChain code here is to create a env SENTENCE_TRANSFORMERS_HOME
that points to the real weight location, not ideal, but acceptable. In this case, we could document the usage on the LangChain
HuggingFaceEmbedding
docstring, but it will transfer the complexity to the user with adding the env variable to their python script. To make it user-friendly, we could offer this cache_folder
option.
@nicolefinnie Yup this make sense. Thanks for the suggestion!
Can we decode the embeddings?
Isn't the dependency on sentence_transformers limiting? I.e. if I wanted to test openassistant llm initialized locally from weights I couldn't use the class HuggingFaceEmbeddings because sentence_transformers doesn't support openassistant. Am I missing something are all hf llms (i.e. open assistant, llama, vicuna etc) compatible with sentence_transformers embeddings (both library and actual model embeddings).
Does someone have a working example of initializing HuggingFaceEmbeddings without an internet connection? I have tried specifying the "cache_folder" parameter with the file path of pre-downloaded embeddings code from huggingface, but it seems to be ignored
Hi, just asking again: Does anyone have a working example of initializing HuggingFaceEmbeddings without an internet connection?
I need to use this class with pre-downloaded embeddings code instead of downloading from huggingface everytime.
Hi, just asking again: Does anyone have a working example of initializing HuggingFaceEmbeddings without an internet connection?
I need to use this class with pre-downloaded embeddings code instead of downloading from huggingface everytime.
I have make it works by this method.
from langchain.embeddings import HuggingFaceEmbeddings
embedding_models_root = "/mnt/embedding_models"
model_ckpt_path = os.path.join(embedding_models_root, 'multi-qa-MiniLM-L6-cos-v1')
embeddings = HuggingFaceEmbeddings(model_name=model_ckpt_path)
text = "This is a test document."
query_result = embeddings.embed_query(text)
doc_result = embeddings.embed_documents([text, "This is not a test document."])
print("==="*20)
print("query_result: \n {}".format(query_result))
print("==="*20)
print("doc_result: \n {}".format(doc_result))
print("==="*20)
Hi, @nicolefinnie! I'm helping the LangChain team manage their backlog and am marking this issue as stale.
It looks like the issue you raised requests adding support for initializing HuggingFaceEmbeddings from cached weights instead of downloading them every time. There have been discussions about potential limitations, working examples, and clarifications on the weight location and download process. One user has even shared a working example of initializing HuggingFaceEmbeddings with pre-downloaded embeddings.
Could you please confirm if this issue is still relevant to the latest version of the LangChain repository? If it is, please let the LangChain team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days. Thank you!