fastembed icon indicating copy to clipboard operation
fastembed copied to clipboard

[Model]: `Qwen/Qwen3-Embedding-0.6B-GGUF`

Open curiousily opened this issue 7 months ago • 2 comments

Which model would you like to support?

Hi,

Would it be possible to add support for the model? Link to the model: https://huggingface.co/Qwen/Qwen3-Embedding-0.6B-GGUF

Thank you!

What are the main advantages of this model?

The embedding model has achieved state-of-the-art performance across a wide range of downstream application evaluations. The 8B size embedding model ranks No.1 in the MTEB multilingual leaderboard (as of May 26, 2025, score 70.58), while the reranking model excels in various text retrieval scenarios.

curiousily avatar Jun 05 '25 12:06 curiousily

just an FYI, grab the safetensors version for now, because the tokenizer was updated and it doesn't look like the official GGUFs got updated

electroglyph avatar Jun 07 '25 10:06 electroglyph

here's my uint8 quantized version with uint8 output, it's compatible with current version of fastembed:

https://huggingface.co/electroglyph/Qwen3-Embedding-0.6B-onnx-uint8

the one retrieval benchmark i ran didn't produce the best results, so i'd be interested to see some other benchmark results if people are willing to submit them

electroglyph avatar Jun 09 '25 01:06 electroglyph

import time
from fastembed import TextEmbedding
from fastembed import TextEmbedding
from fastembed.common.model_description import PoolingType, ModelSource
from qdrant_client import QdrantClient
from qdrant_client.models import Datatype
texts = [''.join(__import__('random').choices('abcdefghijklmnopqrstuvwxyz ', k=20)) 
         for _ in range(3000)]

TextEmbedding.add_custom_model(
    model="electroglyph/Qwen3-Embedding-0.6B-onnx-uint8",
    pooling=PoolingType.DISABLED,
    normalization=False,
    sources=ModelSource(hf="electroglyph/Qwen3-Embedding-0.6B-onnx-uint8"),
    dim=1024,
    model_file="dynamic_uint8.onnx",
)

start = time.time()
embeddings = list(model.embed(texts, batch_size=256, parallel=14))
elapsed = time.time() - start

print(f"Encoded {len(texts)} strings in {elapsed:.2f}s ({len(texts)/elapsed:.0f} strings/sec)")

I had a Encoded 3000 strings in 34.23s (88 strings/sec), that is not really fast, is there a way to bump up encoding performance?

michelkluger avatar Nov 26 '25 06:11 michelkluger

love the benchmark technique =)

these causal embedding models are kinda slow compared to their BERT cousins, not sure there's really any way around it besides using CUDA backend

electroglyph avatar Nov 26 '25 06:11 electroglyph