[Model]: `Qwen/Qwen3-Embedding-0.6B-GGUF`
Which model would you like to support?
Hi,
Would it be possible to add support for the model? Link to the model: https://huggingface.co/Qwen/Qwen3-Embedding-0.6B-GGUF
Thank you!
What are the main advantages of this model?
The embedding model has achieved state-of-the-art performance across a wide range of downstream application evaluations. The 8B size embedding model ranks No.1 in the MTEB multilingual leaderboard (as of May 26, 2025, score 70.58), while the reranking model excels in various text retrieval scenarios.
just an FYI, grab the safetensors version for now, because the tokenizer was updated and it doesn't look like the official GGUFs got updated
here's my uint8 quantized version with uint8 output, it's compatible with current version of fastembed:
https://huggingface.co/electroglyph/Qwen3-Embedding-0.6B-onnx-uint8
the one retrieval benchmark i ran didn't produce the best results, so i'd be interested to see some other benchmark results if people are willing to submit them
import time
from fastembed import TextEmbedding
from fastembed import TextEmbedding
from fastembed.common.model_description import PoolingType, ModelSource
from qdrant_client import QdrantClient
from qdrant_client.models import Datatype
texts = [''.join(__import__('random').choices('abcdefghijklmnopqrstuvwxyz ', k=20))
for _ in range(3000)]
TextEmbedding.add_custom_model(
model="electroglyph/Qwen3-Embedding-0.6B-onnx-uint8",
pooling=PoolingType.DISABLED,
normalization=False,
sources=ModelSource(hf="electroglyph/Qwen3-Embedding-0.6B-onnx-uint8"),
dim=1024,
model_file="dynamic_uint8.onnx",
)
start = time.time()
embeddings = list(model.embed(texts, batch_size=256, parallel=14))
elapsed = time.time() - start
print(f"Encoded {len(texts)} strings in {elapsed:.2f}s ({len(texts)/elapsed:.0f} strings/sec)")
I had a Encoded 3000 strings in 34.23s (88 strings/sec), that is not really fast, is there a way to bump up encoding performance?
love the benchmark technique =)
these causal embedding models are kinda slow compared to their BERT cousins, not sure there's really any way around it besides using CUDA backend