langchain Utility helpers to train and use Custom Embeddings

Problem

The default embeddings (e.g. Ada-002 from OpenAI, etc) are great generalists. However, they are not tailored for your specific use-case.

Proposed Solution

🎉 Customizing Embeddings!

ℹ️ See my tutorial / lessons learned if you're interested in learning more, step-by-step, with screenshots and tips.

How it works

Training

flowchart LR
    subgraph "Basic Text Embeddings"
        Input[Input Text]
        OpenAI[OpenAI Embedding API]
        Embed[Original Embedding]
    end

    subgraph "Train Custom Embedding Matrix"
        Input-->OpenAI
        OpenAI-->Embed

        Raw1["Original Embedding #1"]
        Raw2["Original Embedding #2"]
        Raw3["Original Embedding #3"]

        Embed-->Raw1 & Raw2 & Raw3

        Score1_2["Similarity Label for (#1, #2) => Similar (1)"]
        Raw1 & Raw2-->Score1_2
        Score2_3["Similarity Label for (#2, #3) => Dissimilar (-1)"]
        Raw2 & Raw3-->Score2_3

        Dataset["Similarity Training Dataset\n[First, Second, Label]\n[1, 2, 1]\n[2, 3, -1]\n..."]
        Raw1 & Raw2 & Raw3 -->Dataset
        Score1_2-->|1|Dataset
        Score2_3 -->|-1|Dataset

        Train["Train Custom Embedding Matrix"]
        Dataset-->Train
        Train-->CustomMatrix
        CustomMatrix["Custom Embedding Matrix"]
    end

Embedding

flowchart LR

    subgraph "Similarity Search"
        direction LR

        CustomMatrix["Custom Embedding Matrix\n(e.g. custom-embedding.npy)"]
        Multiply["(Original Embedding) x (Matrix)"]
        CustomMatrix --> Multiply

        Text1["Original Texts #1, #2, #3..."]
        Raw1'["Original Embeddings #1, #2, #3, ..."]
        Custom1["Custom Embeddings #1, #2, #3, ..."]
        Text1-->Raw1'
        Raw1' --> Multiply
        Multiply --> Custom1

        DB["Vector Database"]
        Custom1 -->|Upsert| DB

        Search["Search Query"]
        EmbedSearch["Original Embedding for Search Query"]
        CustomEmbedSearch["Custom Embedding for Search Query"]

        Search-->EmbedSearch
        EmbedSearch-->Multiply
        Multiply-->CustomEmbedSearch

        SimilarFound["Similar Embeddings Found"]
        CustomEmbedSearch -->|Search| DB
        DB-->|Search Results|SimilarFound
    end

Example

from langchain.embeddings import OpenAIEmbeddings, CustomizeEmbeddings

### Generalized Embeddings
embeddings = OpenAIEmbeddings()
text = "This is a test document."
query_result1 = embeddings.embed_query(text)
doc_result1 = embeddings.embed_documents([text])

### Training Customized Embeddings
# Data Preparation
# TODO: How to improve this developer experience using Langchain? Need pairs of Documents with a desired similarity score/label.
data = [
  {
    # Pre-computed embedding vectors
    "vector_1": [0.1, 0.2, -0.3, ...],
    "vector_2": [0.1, 0.2, -0.3, ...],
    "similar": 1, # Or -1
  },
  {
   # Original text which need to be embedded lazily
    "text_1": [0.1, 0.2, -0.3, ...],
    "text_2": [0.1, 0.2, -0.3, ...],
    "similar": 1, # Or -1
  },
]

# Training
options = {
  "modified_embedding_length": 1536,
  "test_fraction": 0.5,
  "random_seed": 123,
  "max_epochs": 30,
  "dropout_fraction": 0.2,
  "progress": True,
  "batch_size": [10, 100, 1000],
  "learning_rate": [10, 100, 1000],
}
customEmbeddings = CustomizeEmbeddings(embeddings) # Pass `embeddings` for computing any embeddings lazily
customEmbeddings.train(data, options) # Stores results in training_results and best_result
all_results = customEmbeddings.training_results
best_result = customEmbeddings.best_result
# best_result = { "accuracy": 0.98, "matrix": [...], "options": {...} }

# Usage
custom_query_result1 = customEmbeddings.embed_query(text)
custom_doc_result1 = customEmbeddings.embed_documents([text])

# Saving
customEmbeddings.save("custom-embedding.npy") # Saves the best

### Loading Customized Embeddings
customEmbeddings2 = CustomizeEmbeddings(embeddings)
customEmbeddings2.load("custom-embedding.npy")

# Usage
custom_query_result2 = customEmbeddings2.embed_query(text)
custom_doc_result2 = customEmbeddings2.embed_documents([text])

Parameters

`.train` options

Param	Type	Description	Default Value
`random_seed`	`int`	Random seed is arbitrary, but is helpful in reproducibility	123
`modified_embedding_length`	`int`	Dimension size of output custom embedding.	1536
`test_fraction`	`float`	% split data into train and test sets	0.5
`max_epochs`	`int`	Total # of iterations using all of the training data in one cycle	10
`dropout_fraction`	`float`	Probability of an element to be zeroed	0.2
`batch_size`	`List[int]`	How many samples per batch to load	[10, 100, 1000]
`learning_rate`	`List[int]`	Works best when similar to `batch_size`	`[10, 100, 1000]`
`progress`	`boolean`	Whether to show progress in logs	`True`

Recommended Reading

P.S. I'd love to personally contribute this to the Langchain repo and community! Please let me know if you think it is a valuable idea and any feedback on the proposed solution. Thank you!

Feb 23 '23 22:02 Glavin001

Relevant: https://github.com/HKUNLP/instructor-embedding

This is currently available for use in LangChain via hugging face instruct. Likely not to be as good as fine tuning, but it's an easy alternative to getting better results with minimal extra effort

Feb 24 '23 23:02 batmanscode

Great point! https://langchain.readthedocs.io/en/latest/reference/modules/embeddings.html?highlight=instructor#langchain.embeddings.HuggingFaceInstructEmbeddings

I would love to compare. Hoping Langchain can be the common layer so developing and comparing these different models:

Basic Embeddings (any embedding model)
Instructor Embeddings (only HuggingFace Instructor model)
Custom matrix (any embedding model)

Feb 25 '23 21:02 Glavin001

Thanks @Glavin001. That would be great! I've recently been thinking about a notebook to compare all the available [in LangChain] LLMs as well. Would be very nice to have comparison notebooks for them as well as embeddings like how refine, map-reduce etc is shown

Feb 26 '23 09:02 batmanscode

Hi, @Glavin001! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, this issue proposes the addition of utility helpers to train and use custom embeddings in the LangChain repository. There has been some discussion in the comments about using the HuggingFace Instructor model as an alternative to fine-tuning, and comparing different models and embeddings. However, the issue remains unresolved.

Could you please let us know if this issue is still relevant to the latest version of the LangChain repository? If it is, please comment on the issue to let us know. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your understanding and contribution to the LangChain project!

Sep 20 '23 16:09 dosubot[bot]

langchain langchain copied to clipboard

Utility helpers to train and use Custom Embeddings

Problem

Proposed Solution

How it works

Training

Embedding

Example

Parameters

.train options

Recommended Reading

langchain
langchain copied to clipboard

`.train` options