fastembed icon indicating copy to clipboard operation
fastembed copied to clipboard

[Question] on an M2 Max FastEmbed is way slower than SentenceTransformers

Open gsantopaolo opened this issue 6 months ago • 5 comments

I'm comparing embedding performances comparing FastEmbed vs SentenceTransformers, and in my experiment, it turns out that FastEmbed is way slower than SentenceTransformers. See a complete working example at my GitHub repo Even if I disable GPU for SentenceTransformers (utils.py --> _pick_device --> make it return "cpu" only) it is still way faster than Fastembedd.

Can someone take a look at it? Maybe I'm doing something wrong.

Tested on an M2 Max 32Gb

gsantopaolo avatar Jun 19 '25 07:06 gsantopaolo

I am experiencing the same issue #539 on my system. Sentence Transformers without GPU is still a lot faster than Fast embeddings.

saadtahlx avatar Jul 12 '25 10:07 saadtahlx

It looks like the team is busy with something more important... We will continue using Sentence Transformers in the meantime. It's sad though because I'm preparing a post on my blog and I don't know if I did everything correctly to compare the two... I also asked Qdrant's CTO, Andre Zayarni, on June 21st if someone could help. Evidently, they can't...

gsantopaolo avatar Jul 12 '25 16:07 gsantopaolo

@saadtahlx @gsantopaolo I explored this issue and ran both sentence-transformers and fastembed on my local macOS (M2 chip) machine. As reported, there are significant performance differences between the two.

I analyzed multiple touchpoints—model loading time, tokenization, model execution, and pre/post-processing steps. The only noticeable difference in execution time was during the actual embedding creation phase. I also reviewed the ONNX runtime configuration within the _load_onnx_model method and found everything functioning as expected, with no clear room for further optimization there.

On deeper investigation, I discovered that PyTorch is well-optimized for macOS and the Arm64 architecture. It leverages the MPS GPU backend instead of the CPU, which provides a considerable speed advantage over ONNX Runtime's CPU-only configuration on Mac machines.

Here are a few suggestions that might help improve performance:

Increase batch_size: Since the dataset is large, increasing the batch_size parameter in model.embed can significantly reduce execution time. For instance, SparseTextEmbedding uses a default batch size of 256, whereas SentenceTransformers defaults to 32. Tuning this parameter could bridge some part of the performance gap. Tune threads parameter: Experimenting with the threads parameter might also yield performance gains, especially during parallel embedding generation.

hemantgarg26 avatar Jul 22 '25 19:07 hemantgarg26

@hemantgarg26 isn't the onnx model able to use GPU? I thought so....

Honestly, Mac is not the issue; the issue is whether the same behavior occurs on Linux/Nvidia or not.

gsantopaolo avatar Jul 23 '25 18:07 gsantopaolo

Excellent investigation @hemantgarg26 and thanks for sharing the results!

@gsantopaolo — onnx can use CPU, GPU and Silicon/Mac (see this), unfortunately, each needs a separate build. At this moment, FastEmbed does not have a Mac/Silicon build.

NirantK avatar Aug 14 '25 14:08 NirantK

unfortunately, each needs a separate build. At this moment, FastEmbed does not have a Mac/Silicon build.

Is this still true as of today?

"⚠️ The official ONNX Runtime now includes arm64 binaries for MacOS as well with Core ML support. Please use the official wheel package as this repository is no longer needed." Ref: https://github.com/cansik/onnxruntime-silicon

It does seem that recent onnx runtime for macOS includes CoreML.

Looking at the docs, it seems enabling the provider requires additional configuration options.

import onnxruntime as ort
model_path = "model.onnx"
providers = [
    ('CoreMLExecutionProvider', {
        "ModelFormat": "MLProgram", "MLComputeUnits": "ALL", 
        "RequireStaticInputShapes": "0", "EnableOnSubgraphs": "0"
    }),
]

session = ort.InferenceSession(model_path, providers=providers)
outputs = ort_sess.run(None, input_feed)

https://onnxruntime.ai/docs/execution-providers/CoreML-ExecutionProvider.html

In this case, it seems all that's necessary is specify a provider when initializing fastembed. https://github.com/qdrant/fastembed/blob/b718cc6a88c847a74fe3437239158efc99d98f97/README.md?plain=1#L226

Did anybody try this? Could a documented example be provided perhaps?

I might/might not find time for this. :)

dokterbob avatar Dec 05 '25 09:12 dokterbob