llamafile icon indicating copy to clipboard operation
llamafile copied to clipboard

Unexpected output from server.cpp `/embedding` endpoint

Open k8si opened this issue 9 months ago • 0 comments

What is the issue?

The embeddings produced by a model running in llamafile seem to be substantially different from those produced by llama.cpp.

llama.cpp embeddings are very close (~0.99 cosine similarity) to those produced by the same model via HuggingFace (which I'm treating as the 'reference embeddings'). On the other hand, llamafile embeddings only get ~0.6 cosine similarity to the HuggingFace embeddings. I tested this across multiple llamafile versions (see results below).

Tested with:

  • llamafile versions from v0.8.1 - v0.7.1
  • llama.cpp commit: 6ecf3189
  • MacBook Pro with Apple M2 Pro (32 GB)
  • MacOS 14.2.1
  • Only tested with one model: all-MiniLM-L6-v2 (BERT architecture)

How to replicate the issue

I put all the scripts/information to replicate this issue in this repo: https://github.com/k8si/replicate-llamafile-embeddings-issue

The short version:

To inspect the differences between embeddings produced by different backends, I embed the text "Alice has had it with computers." with the same(-ish) model running in HF, llama.cpp, and llamafile:

  1. HuggingFace - used sentence-transformers/all-MiniLM-L6-v2 pytorch weights directly
  2. llamafile - used full F32 GGUF version of the model from leliuga/all-MiniLM-L6-v2-GGUF
  3. llama.cpp - used full F32 GGUF version of the model from leliuga/all-MiniLM-L6-v2-GGUF

I use the F32 GGUF to remove any quantization effects and stay as equivalent to the HuggingFace reference model as possible.

Then, I look at the cosine similarity between the embedding produced by HF vs llamafile and compare this to the cosine-sim between the embedding from HF vs llama.cpp. I would expect the two cosine-sim scores to be the same, but they are not, as the results below show.

Results

Results across the last 6 llamafile releases (v0.7.1 to v0.8.1):

$ cat results/results-* | grep -A 2 "RESULTS"

RESULTS (llamafile v0.7.1):
cosine-sim(emb_hf, emb_llamafile) = 0.635290
cosine-sim(emb_hf, emb_llamacpp) = 0.999999
--
RESULTS (llamafile v0.7.2):
cosine-sim(emb_hf, emb_llamafile) = 0.635290
cosine-sim(emb_hf, emb_llamacpp) = 0.999999
--
RESULTS (llamafile v0.7.3):
cosine-sim(emb_hf, emb_llamafile) = 0.635290
cosine-sim(emb_hf, emb_llamacpp) = 0.999999
--
RESULTS (llamafile v0.7.4):
cosine-sim(emb_hf, emb_llamafile) = 0.635290
cosine-sim(emb_hf, emb_llamacpp) = 0.999999
--
RESULTS (llamafile v0.8):
cosine-sim(emb_hf, emb_llamafile) = 0.605049
cosine-sim(emb_hf, emb_llamacpp) = 0.999999
--
RESULTS (llamafile v0.8.1):
cosine-sim(emb_hf, emb_llamafile) = 0.605049
cosine-sim(emb_hf, emb_llamacpp) = 0.999999
--

The test does not work prior to v0.7.1 as BERT was not supported before this release, and all-MiniLM-L6-v2 is a BERT architecture.

k8si avatar May 02 '24 20:05 k8si