langchain
langchain copied to clipboard
Utility helpers to train and use Custom Embeddings
Problem
The default embeddings (e.g. Ada-002 from OpenAI, etc) are great generalists. However, they are not tailored for your specific use-case.
Proposed Solution
đ Customizing Embeddings!
âšī¸ See my tutorial / lessons learned if you're interested in learning more, step-by-step, with screenshots and tips.
How it works
Training
flowchart LR
subgraph "Basic Text Embeddings"
Input[Input Text]
OpenAI[OpenAI Embedding API]
Embed[Original Embedding]
end
subgraph "Train Custom Embedding Matrix"
Input-->OpenAI
OpenAI-->Embed
Raw1["Original Embedding #1"]
Raw2["Original Embedding #2"]
Raw3["Original Embedding #3"]
Embed-->Raw1 & Raw2 & Raw3
Score1_2["Similarity Label for (#1, #2) => Similar (1)"]
Raw1 & Raw2-->Score1_2
Score2_3["Similarity Label for (#2, #3) => Dissimilar (-1)"]
Raw2 & Raw3-->Score2_3
Dataset["Similarity Training Dataset\n[First, Second, Label]\n[1, 2, 1]\n[2, 3, -1]\n..."]
Raw1 & Raw2 & Raw3 -->Dataset
Score1_2-->|1|Dataset
Score2_3 -->|-1|Dataset
Train["Train Custom Embedding Matrix"]
Dataset-->Train
Train-->CustomMatrix
CustomMatrix["Custom Embedding Matrix"]
end
Embedding
flowchart LR
subgraph "Similarity Search"
direction LR
CustomMatrix["Custom Embedding Matrix\n(e.g. custom-embedding.npy)"]
Multiply["(Original Embedding) x (Matrix)"]
CustomMatrix --> Multiply
Text1["Original Texts #1, #2, #3..."]
Raw1'["Original Embeddings #1, #2, #3, ..."]
Custom1["Custom Embeddings #1, #2, #3, ..."]
Text1-->Raw1'
Raw1' --> Multiply
Multiply --> Custom1
DB["Vector Database"]
Custom1 -->|Upsert| DB
Search["Search Query"]
EmbedSearch["Original Embedding for Search Query"]
CustomEmbedSearch["Custom Embedding for Search Query"]
Search-->EmbedSearch
EmbedSearch-->Multiply
Multiply-->CustomEmbedSearch
SimilarFound["Similar Embeddings Found"]
CustomEmbedSearch -->|Search| DB
DB-->|Search Results|SimilarFound
end
Example
from langchain.embeddings import OpenAIEmbeddings, CustomizeEmbeddings
### Generalized Embeddings
embeddings = OpenAIEmbeddings()
text = "This is a test document."
query_result1 = embeddings.embed_query(text)
doc_result1 = embeddings.embed_documents([text])
### Training Customized Embeddings
# Data Preparation
# TODO: How to improve this developer experience using Langchain? Need pairs of Documents with a desired similarity score/label.
data = [
{
# Pre-computed embedding vectors
"vector_1": [0.1, 0.2, -0.3, ...],
"vector_2": [0.1, 0.2, -0.3, ...],
"similar": 1, # Or -1
},
{
# Original text which need to be embedded lazily
"text_1": [0.1, 0.2, -0.3, ...],
"text_2": [0.1, 0.2, -0.3, ...],
"similar": 1, # Or -1
},
]
# Training
options = {
"modified_embedding_length": 1536,
"test_fraction": 0.5,
"random_seed": 123,
"max_epochs": 30,
"dropout_fraction": 0.2,
"progress": True,
"batch_size": [10, 100, 1000],
"learning_rate": [10, 100, 1000],
}
customEmbeddings = CustomizeEmbeddings(embeddings) # Pass `embeddings` for computing any embeddings lazily
customEmbeddings.train(data, options) # Stores results in training_results and best_result
all_results = customEmbeddings.training_results
best_result = customEmbeddings.best_result
# best_result = { "accuracy": 0.98, "matrix": [...], "options": {...} }
# Usage
custom_query_result1 = customEmbeddings.embed_query(text)
custom_doc_result1 = customEmbeddings.embed_documents([text])
# Saving
customEmbeddings.save("custom-embedding.npy") # Saves the best
### Loading Customized Embeddings
customEmbeddings2 = CustomizeEmbeddings(embeddings)
customEmbeddings2.load("custom-embedding.npy")
# Usage
custom_query_result2 = customEmbeddings2.embed_query(text)
custom_doc_result2 = customEmbeddings2.embed_documents([text])
Parameters
.train options
| Param | Type | Description | Default Value |
|---|---|---|---|
random_seed |
int |
Random seed is arbitrary, but is helpful in reproducibility | 123 |
modified_embedding_length |
int |
Dimension size of output custom embedding. | 1536 |
test_fraction |
float |
% split data into train and test sets | 0.5 |
max_epochs |
int |
Total # of iterations using all of the training data in one cycle | 10 |
dropout_fraction |
float |
Probability of an element to be zeroed | 0.2 |
batch_size |
List[int] |
How many samples per batch to load | [10, 100, 1000] |
learning_rate |
List[int] |
Works best when similar to batch_size |
[10, 100, 1000] |
progress |
boolean |
Whether to show progress in logs | True |
Recommended Reading
- My tutorial / lessons learned
- Customizing embeddings from OpenAI Cookbook
- @pullerz 's blog post on lessons learned
P.S. I'd love to personally contribute this to the Langchain repo and community! Please let me know if you think it is a valuable idea and any feedback on the proposed solution. Thank you!
Relevant: https://github.com/HKUNLP/instructor-embedding
This is currently available for use in LangChain via hugging face instruct. Likely not to be as good as fine tuning, but it's an easy alternative to getting better results with minimal extra effort
Great point! https://langchain.readthedocs.io/en/latest/reference/modules/embeddings.html?highlight=instructor#langchain.embeddings.HuggingFaceInstructEmbeddings
I would love to compare. Hoping Langchain can be the common layer so developing and comparing these different models:
- Basic Embeddings (any embedding model)
- Instructor Embeddings (only HuggingFace Instructor model)
- Custom matrix (any embedding model)
Thanks @Glavin001. That would be great! I've recently been thinking about a notebook to compare all the available [in LangChain] LLMs as well. Would be very nice to have comparison notebooks for them as well as embeddings like how refine, map-reduce etc is shown
Hi, @Glavin001! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.
From what I understand, this issue proposes the addition of utility helpers to train and use custom embeddings in the LangChain repository. There has been some discussion in the comments about using the HuggingFace Instructor model as an alternative to fine-tuning, and comparing different models and embeddings. However, the issue remains unresolved.
Could you please let us know if this issue is still relevant to the latest version of the LangChain repository? If it is, please comment on the issue to let us know. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.
Thank you for your understanding and contribution to the LangChain project!