FastChat Add embedding API

trafficstars

For OpenAI API, they provide both Completion and Embedding API. And they claimed that text embedding was trained by Contrastive Pre-Training. Now LLaMA/Vicuna have basically replicated the function of OpenAI's completion. So I wonder if any way to get Embedding also? Thank you.

Apr 06 '23 08:04 lan2720

@lan2720 We currently do not support APIs, cuz that will cause too much stress on our server. @lan2720

Apr 06 '23 09:04 zhisbug

@zhisbug APIs is not necessary, but is it possible to get embedding from vicuna model on my local machine?

Apr 06 '23 09:04 lan2720

Embedding is quite essential for QA tasks using non-parametric knowledge. It would be really nice if the model were able to output some vectors for QA.

Apr 06 '23 09:04 rayanywhere

it is not very difficult to allow the model to output embeddings. Maybe improve this part of code and expose a fastAPI endpoint to pass embeddings?

Contributions are welcome.

Apr 06 '23 10:04 zhisbug

@zhisbug I've tried to extract the hidden state of the last layer for each token in input text, then compute the mean value of those hidden states as input text embedding. But the embedding extracted like that doesn't perform well in text retrieval task. Any suggestion for that? My code is following:

import json
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, LlamaTokenizer
from fastchat.conversation import Conversation
from fastchat.conversation import conv_templates, SeparatorStyle, default_conversation, conv_bair_v1
from fastchat import utils

def load_model(model_name, device, num_gpus, load_8bit=False):
    if device == "cuda":
        kwargs = {"torch_dtype": torch.float16}
        if load_8bit:
            if num_gpus != "auto" and int(num_gpus) != 1:
                print("8-bit weights are not supported on multiple GPUs. Revert to use one GPU.")
            kwargs.update({"load_in_8bit": True, "device_map": "auto"})
        else:
            if num_gpus == "auto":
                kwargs["device_map"] = "auto"
            else:
                num_gpus = int(num_gpus)
                if num_gpus != 1:
                    kwargs.update({
                        "device_map": "auto",
                        "max_memory": {i: "19GiB" for i in range(num_gpus)},
                    })
    elif device == "cpu":
        kwargs = {}
    else:
        raise ValueError(f"Invalid device: {device}")

    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name, output_hidden_states=True,
        low_cpu_mem_usage=True, **kwargs)

    # calling model.cuda() mess up weights if loading 8-bit weights
    if device == "cuda" and num_gpus == 1 and not load_8bit:
        model.cuda()

    return model, tokenizer

def get_final_prompt(input_text, prompt_template_name="v1"):
    # input_text = "hello"
    conv = conv_templates[conv_template].copy()
    conv.messages = []
    pure_prompt = conv.get_prompt()
    conv.append_message(conv.roles[0], "")
    pure_prompt_with_role = conv.get_prompt() + " "
    conv.messages = []
    conv.append_message(conv.roles[0], input_text)
    final_prompt = conv.get_prompt()
    final_prompt = final_prompt.rstrip(conv.sep)

    # without <s> case
    start_token_idx = len(tokenizer.tokenize(pure_prompt_with_role))-1
    return start_token_idx, final_prompt


def get_embedding(input_text, model, tokenizer,
                  prompt_template_name="v1",
                  add_pua_first=False):
    start_token_index = 0
    if add_pua_first:
        start_token_index, final_prompt = get_final_prompt(input_text, prompt_template_name=prompt_template_name)
    else:
        final_prompt = input_text#get_final_prompt(input_text, prompt_template_name=prompt_template_name)

    input_ids = tokenizer(final_prompt).input_ids
    out = model(torch.as_tensor([input_ids], device="cuda"), use_cache=True)
    # -1 means the last layer
    last_out_layers = out.hidden_states[-1] # [batch_size, num_tokens, dim]
    # batch_size=1, so get the first case
    last_out_hidden_state = last_out_layers[0]
    # remove the first token <s>
    last_out_hidden_state = last_out_hidden_state[start_token_index+1:]
    # [num_tokens, dim] -> [dim]
    embedding = torch.mean(last_out_hidden_state, 0)
    return embedding.detach().cpu().numpy().tolist()

if __name__ == '__main__':
    model_name = "/data/pretrained/vicuna-13b"
    device = "cuda"
    num_gpus=1
    load_8bit=False
    model, tokenizer = load_model(model_name, device, num_gpus, load_8bit)
    emb = get_embedding( "hello world. this is nice.", model, tokenizer, add_pua_first=True)

Apr 07 '23 11:04 lan2720

@lan2720 As mentioned here, causal LM should use the last token for embedding. Can you give it a try?

LlamaForSequenceClassification uses the last token in order to do the classification, as other causal models (e.g. GPT-2) do.

Apr 11 '23 12:04 BIGPPWONG

@lan2720 hi, are you still working on this?

Apr 19 '23 17:04 YANG-H

Contributions are welcome. Feel free to submit a PR and ping me for review

Apr 21 '23 01:04 zhisbug

Hey, I got curious about this and tried implementing it as well - however, I have no idea what I'm doing :)

Here's what I understood so far:

GPT-2-like models use the last token for classification, and embeddings should also use the last token.
The Llama implementation on HF has two methods:
get_input_embeddings()

Code: https://github.com/huggingface/transformers/blob/d04ec99bec8a0b432fc03ed60cea9a1a20ebaf3c/src/transformers/models/llama/modeling_llama.py#L622

this returns this type

<class 'torch.nn.modules.sparse.Embedding'>
Embedding(32000, 4096, padding_idx=31999)

get_output_embeddings()

Code: https://github.com/huggingface/transformers/blob/d04ec99bec8a0b432fc03ed60cea9a1a20ebaf3c/src/transformers/models/llama/modeling_llama.py#L628

this returns this type

Linear(in_features=4096, out_features=32000, bias=False)

In GPT-2, we should just tokenize the input, and check in the embedding map, see for instance: https://github.com/huggingface/transformers/issues/1458
Thus, the equivalent for Llama could maybe use this?

@torch.inference_mode()
def get_embeddings(model, tokenizer, prompt, device):
    input_ids = tokenizer(prompt).input_ids
    input_embeddings = model.get_input_embeddings()
    result = input_embeddings(torch.LongTensor([input_ids[-1]]))
    return (float(x) for x in result.cpu().detach()[0])

If I run this code I get back a tensor of shape [1, 4096]. I have no idea if this is right, however.

Apr 22 '23 23:04 paolorechia

So using the approach above yielded non sense results. I've tried it with mean instead:

@torch.inference_mode()
def get_embeddings(model, tokenizer, prompt):
    input_ids = tokenizer(prompt).input_ids
    input_embeddings = model.get_input_embeddings()
    embeddings = input_embeddings(torch.LongTensor([input_ids]))
    mean = torch.mean(embeddings[0], 0).cpu().detach()
    return mean

It gives meaningful results, although I did not think the performance is that good when I tested it with Law of Plato (link to my test repo : https://github.com/paolorechia/learn-langchain/pull/3/files#diff-878fa05eb59767fe781b8de4a4a7b4efa1b4d6c4080f1381eeea3a9d1d9254d9)

Apr 23 '23 13:04 paolorechia

Thank you for the work @paolorechia. Out of curiosity:

Do you get better results from using embeddings from OpenAI API?
Also, how about using other private specialized embedding models from hugging face via https://www.sbert.net/index.html etc.?

Does your work indicate that the vicuna model just isn't trained/purpose built for generating good embeddings?

Apr 24 '23 15:04 kostecky

Hey, thanks for the interest! @kostecky

Those are very good questions, unfortunately so far my impression was based on gut feeling.

Ideally we would have a benchmark where we could compare the different embeddings in an objective way, right?

I would be happy to test the sentence transformers and compare, but I don’t have a good dataset for benchmarking, would you happen to know one we could use?

Regarding OpenAI, I don’t have credits currently to test it, but if we get to the point where we see Vicuna embeddings perform similar to or better than sentence transformers, then I’d be willing to spend a few bucks to test it.

Apr 24 '23 15:04 paolorechia

As follow-up, I did a quick test using the Wikipedia source code. I compared the embeddings I extracted from Vicuna to https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2, and as suspected, the sentence transformer performed much better:

https://gist.github.com/paolorechia/c9f6aa8316882fc0710d54eb8dfa3f52

While it's not a rigorous benchmark, I think the difference is great enough to see that the extracted embeddings are not good.

Apr 25 '23 20:04 paolorechia

Thanks @paolorechia - I have this sneaking suspicion that some people may think you have to use embeddings from the same model you're using for inference, but as you just demonstrated, that's not the case.

For Q&A and other semantic matching purposes, using a fine-tuned model to generate and query embeddings for that purpose is a better idea.

Are there other use cases for why people would want access to Vicuna embeddings if they don't perform particularly well for classifying and retrieving semantically similar phrases?

Apr 26 '23 23:04 kostecky

Supported in #663. Closing.

Please try and let us know your feedback. A comparison report between vicuna embedding and sentence transformer embedding would be appreciated.

May 07 '23 23:05 zhisbug

FastChat FastChat copied to clipboard

Add embedding API

FastChat
FastChat copied to clipboard