Feature Request: Add Support for Qwen3-Reranker Model
Feature request
Description:
I would like to request support for the Qwen3-Reranker model (specifically Qwen3-Reranker-0.6B) in the text-embeddings-inference repository.
Currently, there appears to be an issue when trying to convert Qwen3-Reranker from Qwen3ForCausalLM to Qwen3ForSequenceClassification, with the error message indicating that the classifier model type is not supported for Qwen3.
Additional Context:
The Qwen3-Reranker model has been discussed on HuggingFace (reference: https://huggingface.co/Qwen/Qwen3-Reranker-0.6B/discussions/3), but proper integration with the inference server seems to require additional support.
testing with docker image ghcr.io/huggingface/text-embeddings-inference:turing-1.7.2
error traceback
rerank-qwen3 | 2025-06-17T02:12:36.220459Z INFO text_embeddings_router: router/src/lib.rs:235: Starting model backend rerank-qwen3 | 2025-06-17T02:12:36.639564Z INFO text_embeddings_backend_candle: backends/candle/src/lib.rs:463: Starting FlashQwen3 model on Cuda(CudaDevice(DeviceId(1))) rerank-qwen3 | 2025-06-17T02:12:36.640020Z ERROR text_embeddings_backend: backends/src/lib.rs:388: Could not start Candle backend: Could not start backend:
classifiermodel type is not supported for Qwen3 rerank-qwen3 | Error: Could not create backend rerank-qwen3 | rerank-qwen3 | Caused by: rerank-qwen3 | Could not start backend: Could not start a suitable backend
Requested Features:
Add support for Qwen3-Reranker model architecture
Implement proper handling of the sequence classification variant
Include the model in the supported model types for reranking tasks
Use Case:
This would enable users to deploy Qwen3-Reranker as part of their embedding and retrieval pipelines using the optimized inference server.
Would you be able to provide guidance on what would be needed to implement this support? I'm happy to provide additional details or testing if needed.
Motivation
Qwen3-Reranker is a high-performance reranking model developed by Alibaba Cloud, offering a strong balance between efficiency and accuracy for retrieval-augmented generation (RAG) and semantic search tasks. Currently, text-embeddings-inference (TEI) does not support Qwen3ForSequenceClassification, making it difficult to deploy Qwen3-Reranker in optimized inference pipelines.
Supporting Qwen3-Reranker in TEI would:
Enable seamless integration with existing RAG and search systems.
Provide optimized inference (e.g., FlashAttention, dynamic batching) compared to manual deployment.
Expand TEI's coverage of popular open-weight models, aligning with the growing adoption of the Qwen series (Qwen2, Qwen1.5, etc.).
Given the increasing use of Qwen models in industry and research, adding native support for Qwen3-Reranker would significantly improve user experience and broaden TEI's applicability.
Your contribution
I'm opening this issue to request support for Qwen3-Reranker. While I don't have a concrete implementation yet, I'm happy to:
- Provide testing on different hardware environments
- Share benchmark results
- Collaborate on validating any potential solutions
Looking forward to the support of Qwen3-Reranker series models!
Qwen3 embedding and rerank model which base on qwen3 chat model perform pretty good in some fields, please consider this request
Looking forward to the support of Qwen3-Reranker series models!
Looking forward to the support of Qwen3-Reranker series models!
+ 1
Looking forward to the support of Qwen3-Reranker series models!
Looking forward to the support of Qwen3-Reranker series models!
Any update about this?
What is required in order to properly run qwen3 rerankers with the latest TEI version?
Is using --pooling last-token enough?
Can anyone guide us?
Thanks in advance!
What is required in order to properly run qwen3 rerankers with the latest TEI version? Is using
--pooling last-tokenenough? Can anyone guide us? Thanks in advance!
thanks for reply. use following command
docker run --gpus all -p 8080:80 -v /root/Qwen3-Reranker-0.6B:/data ghcr.io/huggingface/text-embeddings-inference:1.8.0 --model-id /data
2025-08-08T15:26:34.314552Z INFO text_embeddings_router: router/src/main.rs:202: Args { model_id: "/****", revision: None, tokenization_workers: None, dtype: None, pooling: None, max_concurrent_requests: 512, max_batch_tokens: 16384, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: false, default_prompt_name: None, default_prompt: None, dense_path: Some("2_Dense"), hf_api_token: None, hf_token: None, hostname: "8f210320888f", port: 80, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: Some("/data"), payload_limit: 2000000, api_key: None, json_output: false, disable_spans: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", prometheus_port: 9000, cors_allow_origin: None }
Error: The --pooling arg is not set and we could not find a pooling configuration (1_Pooling/config.json) for this model.
Caused by: No such file or directory (os error 2)
then check the content of config.json file in qwen3 reranker repo: https://huggingface.co/Qwen/Qwen3-Reranker-0.6B/blob/main/config.json it looks like config file of normal language model
then change it like
{
"architectures": [
"Qwen3ForSequenceClassification"
],
"id2label": {
"0": "LABEL_0"
},
"label2id": {
"LABEL_0": 0
},
... same content...
(py13) root@DESKTOP-FT1RFNR:~# docker run --gpus all -p 8080:80 -v /root/Qwen3-Reranker-0.6B:/data ghcr.io/huggingface/text-embeddings-inference:1.8.0 --model-id /data
2025-08-08T15:38:16.766271Z INFO text_embeddings_router: router/src/main.rs:202: Args { model_id: "/****", revision: None, tokenization_workers: None, dtype: None, pooling: None, max_concurrent_requests: 512, max_batch_tokens: 16384, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: false, default_prompt_name: None, default_prompt: None, dense_path: Some("2_Dense"), hf_api_token: None, hf_token: None, hostname: "c081edb149d5", port: 80, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: Some("/data"), payload_limit: 2000000, api_key: None, json_output: false, disable_spans: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", prometheus_port: 9000, cors_allow_origin: None }
2025-08-08T15:38:17.008357Z WARN text_embeddings_router: router/src/lib.rs:193: Could not find a Sentence Transformers config
2025-08-08T15:38:17.008389Z INFO text_embeddings_router: router/src/lib.rs:197: Maximum number of tokens per request: 40960
2025-08-08T15:38:17.008535Z INFO text_embeddings_core::tokenization: core/src/tokenization.rs:38: Starting 20 tokenization workers
2025-08-08T15:38:17.568725Z INFO text_embeddings_router: router/src/lib.rs:239: Starting model backend
2025-08-08T15:38:17.869218Z INFO text_embeddings_backend_candle: backends/candle/src/lib.rs:466: Starting FlashQwen3 model on Cuda(CudaDevice(DeviceId(1)))
2025-08-08T15:38:17.869949Z ERROR text_embeddings_backend: backends/src/lib.rs:411: Could not start Candle backend: Could not start backend: classifier model type is not supported for Qwen3
Error: Could not create backend
Caused by: Could not start backend: Could not start a suitable backend
just added the support in this PR. check it out please https://github.com/huggingface/text-embeddings-inference/pull/695
convert_to_st.py
from sentence_transformers import CrossEncoder
# HF Qwen3-Reranker model
src_model = "/PATH/Qwen/Qwen3-Reranker-4B-HF"
# sentence-transformers
dst_model = "/PATH/Qwen/Qwen3-Reranker-4B"
# Loading HuggingFace model
print(f"Loading HF model from {src_model} ...")
model = CrossEncoder(src_model)
# Saving as sentence-transformers
print(f"Saving as sentence-transformers CrossEncoder to {dst_model} ...")
model.save(dst_model)
print("✅ Done! You can now mount this folder to TEI and call /rerank")
convert_to_st.py
from sentence_transformers import CrossEncoder # HF Qwen3-Reranker model src_model = "/PATH/Qwen/Qwen3-Reranker-4B-HF" # sentence-transformers dst_model = "/PATH/Qwen/Qwen3-Reranker-4B" # Loading HuggingFace model print(f"Loading HF model from {src_model} ...") model = CrossEncoder(src_model) # Saving as sentence-transformers print(f"Saving as sentence-transformers CrossEncoder to {dst_model} ...") model.save(dst_model) print("✅ Done! You can now mount this folder to TEI and call /rerank")
Does this solution work for anyone?