llamafile
llamafile copied to clipboard
Unexpected output from server.cpp `/embedding` endpoint
What is the issue?
The embeddings produced by a model running in llamafile seem to be substantially different from those produced by llama.cpp.
llama.cpp embeddings are very close (~0.99 cosine similarity) to those produced by the same model via HuggingFace (which I'm treating as the 'reference embeddings'). On the other hand, llamafile embeddings only get ~0.6 cosine similarity to the HuggingFace embeddings. I tested this across multiple llamafile versions (see results below).
Tested with:
- llamafile versions from v0.8.1 - v0.7.1
- llama.cpp commit:
6ecf3189
- MacBook Pro with Apple M2 Pro (32 GB)
- MacOS 14.2.1
- Only tested with one model: all-MiniLM-L6-v2 (BERT architecture)
How to replicate the issue
I put all the scripts/information to replicate this issue in this repo: https://github.com/k8si/replicate-llamafile-embeddings-issue
The short version:
To inspect the differences between embeddings produced by different backends, I embed the text "Alice has had it with computers." with the same(-ish) model running in HF, llama.cpp, and llamafile:
- HuggingFace - used sentence-transformers/all-MiniLM-L6-v2 pytorch weights directly
- llamafile - used full F32 GGUF version of the model from leliuga/all-MiniLM-L6-v2-GGUF
- llama.cpp - used full F32 GGUF version of the model from leliuga/all-MiniLM-L6-v2-GGUF
I use the F32 GGUF to remove any quantization effects and stay as equivalent to the HuggingFace reference model as possible.
Then, I look at the cosine similarity between the embedding produced by HF vs llamafile and compare this to the cosine-sim between the embedding from HF vs llama.cpp. I would expect the two cosine-sim scores to be the same, but they are not, as the results below show.
Results
Results across the last 6 llamafile releases (v0.7.1 to v0.8.1):
$ cat results/results-* | grep -A 2 "RESULTS"
RESULTS (llamafile v0.7.1):
cosine-sim(emb_hf, emb_llamafile) = 0.635290
cosine-sim(emb_hf, emb_llamacpp) = 0.999999
--
RESULTS (llamafile v0.7.2):
cosine-sim(emb_hf, emb_llamafile) = 0.635290
cosine-sim(emb_hf, emb_llamacpp) = 0.999999
--
RESULTS (llamafile v0.7.3):
cosine-sim(emb_hf, emb_llamafile) = 0.635290
cosine-sim(emb_hf, emb_llamacpp) = 0.999999
--
RESULTS (llamafile v0.7.4):
cosine-sim(emb_hf, emb_llamafile) = 0.635290
cosine-sim(emb_hf, emb_llamacpp) = 0.999999
--
RESULTS (llamafile v0.8):
cosine-sim(emb_hf, emb_llamafile) = 0.605049
cosine-sim(emb_hf, emb_llamacpp) = 0.999999
--
RESULTS (llamafile v0.8.1):
cosine-sim(emb_hf, emb_llamafile) = 0.605049
cosine-sim(emb_hf, emb_llamacpp) = 0.999999
--
The test does not work prior to v0.7.1 as BERT was not supported before this release, and all-MiniLM-L6-v2
is a BERT architecture.