[Question] on an M2 Max FastEmbed is way slower than SentenceTransformers
I'm comparing embedding performances comparing FastEmbed vs SentenceTransformers, and in my experiment, it turns out that FastEmbed is way slower than SentenceTransformers. See a complete working example at my GitHub repo Even if I disable GPU for SentenceTransformers (utils.py --> _pick_device --> make it return "cpu" only) it is still way faster than Fastembedd.
Can someone take a look at it? Maybe I'm doing something wrong.
Tested on an M2 Max 32Gb
I am experiencing the same issue #539 on my system. Sentence Transformers without GPU is still a lot faster than Fast embeddings.
It looks like the team is busy with something more important... We will continue using Sentence Transformers in the meantime. It's sad though because I'm preparing a post on my blog and I don't know if I did everything correctly to compare the two... I also asked Qdrant's CTO, Andre Zayarni, on June 21st if someone could help. Evidently, they can't...
@saadtahlx @gsantopaolo I explored this issue and ran both sentence-transformers and fastembed on my local macOS (M2 chip) machine. As reported, there are significant performance differences between the two.
I analyzed multiple touchpoints—model loading time, tokenization, model execution, and pre/post-processing steps. The only noticeable difference in execution time was during the actual embedding creation phase. I also reviewed the ONNX runtime configuration within the _load_onnx_model method and found everything functioning as expected, with no clear room for further optimization there.
On deeper investigation, I discovered that PyTorch is well-optimized for macOS and the Arm64 architecture. It leverages the MPS GPU backend instead of the CPU, which provides a considerable speed advantage over ONNX Runtime's CPU-only configuration on Mac machines.
Here are a few suggestions that might help improve performance:
Increase batch_size: Since the dataset is large, increasing the batch_size parameter in model.embed can significantly reduce execution time. For instance, SparseTextEmbedding uses a default batch size of 256, whereas SentenceTransformers defaults to 32. Tuning this parameter could bridge some part of the performance gap. Tune threads parameter: Experimenting with the threads parameter might also yield performance gains, especially during parallel embedding generation.
@hemantgarg26 isn't the onnx model able to use GPU? I thought so....
Honestly, Mac is not the issue; the issue is whether the same behavior occurs on Linux/Nvidia or not.
Excellent investigation @hemantgarg26 and thanks for sharing the results!
@gsantopaolo — onnx can use CPU, GPU and Silicon/Mac (see this), unfortunately, each needs a separate build. At this moment, FastEmbed does not have a Mac/Silicon build.
unfortunately, each needs a separate build. At this moment, FastEmbed does not have a Mac/Silicon build.
Is this still true as of today?
"⚠️ The official ONNX Runtime now includes arm64 binaries for MacOS as well with Core ML support. Please use the official wheel package as this repository is no longer needed." Ref: https://github.com/cansik/onnxruntime-silicon
It does seem that recent onnx runtime for macOS includes CoreML.
Looking at the docs, it seems enabling the provider requires additional configuration options.
import onnxruntime as ort
model_path = "model.onnx"
providers = [
('CoreMLExecutionProvider', {
"ModelFormat": "MLProgram", "MLComputeUnits": "ALL",
"RequireStaticInputShapes": "0", "EnableOnSubgraphs": "0"
}),
]
session = ort.InferenceSession(model_path, providers=providers)
outputs = ort_sess.run(None, input_feed)
https://onnxruntime.ai/docs/execution-providers/CoreML-ExecutionProvider.html
In this case, it seems all that's necessary is specify a provider when initializing fastembed.
https://github.com/qdrant/fastembed/blob/b718cc6a88c847a74fe3437239158efc99d98f97/README.md?plain=1#L226
Did anybody try this? Could a documented example be provided perhaps?
I might/might not find time for this. :)