Qwen3-Embedding models do not work
Model description
The qwen3 models easily outperform nearly every other open source model for embeddings, however it does not work in infinity due to outdated transformers. My docker compose file:
version: '3.8'
services:
infinity:
image: michaelf34/infinity:latest
environment:
DO_NOT_TRACK: 1 # Disable telemetry
INFINITY_BETTERTRANSFORMER: True
HF_HOME: /app/data # Use /app/data
INFINITY_MODEL_ID: Qwen/Qwen3-Embedding-4B;Qwen/Qwen3-Reranker-4B # Model(s), semicolon separated
INFINITY_PORT: 7997 # Port
INFINITY_API_KEY: foo # Optional API key
INFINITY_DEVICE: cuda
INFINITY_VECTOR_DISK_CACHE: True
volumes:
- ./infinity:/app/data:rw # Persist /app/data to ./infinity in the current directory
- ./models:/data/models:ro # Mount local models to /data/models in read-only mode
- ./infinity:/data/hf_cache:rw # Mount cache data to /data/hf_cache in read-write mode
ports:
- "7997:7997" # Flexible port mapping
command: v2
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
volumes:
infinity: # Named volume declaration
This results in the error:
infinity-1 | ValueError: The checkpoint you are trying to load has model type `qwen3` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.
To fix this, you should just be able to update transformers to >4.51.0 as per the qwen documentation.
Open source status & huggingface transformers.
- [x] The model implementation is available on transformers
- [x] The model weights are available on huggingface-hub
- [x] I verified that the model is currently not running in the latest version
pip install infinity_emb[all] --upgrade - [ ] I made the authors of the model aware that I want to use it with infinity_emb & check if they are aware of the issue.
Hi,
thank you for reporting this. Related issue: https://github.com/michaelfeil/infinity/issues/598
Until a new docker image is released, you should be able to build a custom docker image by following the suggestions in the linked issue.
Hi,
thank you for reporting this. Related issue: #598
Until a new docker image is released, you should be able to build a custom docker image by following the suggestions in the linked issue.
Is there any way to do this without needing to completely rebuild all of the docker containers? I'm aware I could technically do something like
sudo docker compose run -it --entrypoint sh infinity
pip install --upgrade transformers
pip install --upgrade accelerate
infinity_emb v2
But that seems really arduous and kind of janky
Okay i found a kinda workaround, if you simply add the following line to your docker-compose it will update those packages every start before starting infinity:
entrypoint: ["sh", "-cv", "pip install --upgrade transformers accelerate && infinity_emb v2"]
You don't need to run the commands on every start. Just build a custom image where those commands are run when building the image.
Create a "Dockerfile" with the following:
FROM michaelf34/infinity:latest
RUN pip install --upgrade transformers accelerate
Now change your docker-compose.yaml file from this:
infinity:
image: michaelf34/infinity:latest
...to this:
infinity:
build: . # Build from the Dockerfile in this directory
image: infinity-custom:latest # What to call this new image
Run docker compose up -d and you're done. Then revert to the original image once it has been patched.
I'm aware, however constraints in my current environment mean that injecting the commands is the simplest way to do things, though your approach is better if a user is able to build another image etc.
I updated transformers but it still not support Qwen3-reranker-8B model。 Here is my dockerfile:
FROM michaelf34/infinity:0.0.76
RUN pip install --upgrade transformers accelerate
docker-compose.yml:
services:
qwen3-reranker-8b:
container_name: qwen3-reranker-8b
#image: vllm/vllm-openai:v0.8.5
image: michaelf34/infinity:0.0.76-update
restart: always
#command: |
# --model "/models/Qwen3-Reranker-8B" --served-model-name Qwen/Qwen3-Reranker-8B --port 7997 --hf_overrides '{"architectures": ["Qwen3ForSequenceClassification"],"classifier_from_token": ["no", "yes"],"is_original_qwen3_reranker": true}' --task score
command: |
v2 --model-id "/models/Qwen3-Reranker-8B" --served-model-name Qwen/Qwen3-Reranker-8B --revision "main" --dtype bfloat16 --batch-size 32 --device cuda --engine torch --port 7997
tty: true
volumes:
- "/etc/localtime:/etc/localtime:ro"
- "/home/qingfu.zeng/Qwen:/models"
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ['4']
capabilities: [gpu]
memory: 10G
limits:
memory: 20G
ports:
- "7997:7997"
After deployed, the rerank api call returns error. api call command:
curl -X 'POST' 'http://172.16.30.224:7997/rerank' -H 'accept: application/json' -H 'Content-Type: application/json' -d '{
"query": "string",
"documents": [
"string"
],
"return_documents": false,
"raw_scores": false,
"model": "Qwen/Qwen3-Reranker-8B",
"top_n": 1
}'
errors:
{
"error": {
"message": "ModelNotDeployedError: model=`/models/Qwen3-Reranker-8B` does not support `rerank`. Reason: the loaded moded cannot fullyfill `rerank`. Options are {'embed'}.",
"type": null,
"param": null,
"code": 400
}
}
I use this code to run Qwen3 embedding and reranking models:
"""
Standalone deployment for embeddings and reranking using Infinity.
This deployment includes:
- Embeddings: Qwen3-Embedding-0.6B
- Reranking: Qwen/Qwen3-Reranker-0.6B
"""
import subprocess
import sys
import os
import logging
from pathlib import Path
# Model configuration
EMBEDDER_MODEL = "Qwen/Qwen3-Embedding-0.6B"
RERANKER_MODEL = "Qwen/Qwen3-Reranker-0.6B"
# Configuration
PORT = 7997
BATCH_SIZE = 6
HOST = "0.0.0.0"
# Environment setup
def setup_environment():
"""Set up environment variables for optimal performance."""
# Use current directory for cache instead of /app
current_dir = Path.cwd()
cache_base = current_dir / ".cache"
env_vars = {
"INFINITY_QUEUE_SIZE": "2048",
"INFINITY_HOME": str(cache_base / "infinity"),
"HF_HOME": str(cache_base / "huggingface"),
"HF_HUB_ENABLE_HF_TRANSFER": "1",
"PYTORCH_CUDA_ALLOC_CONF": "expandable_segments:True",
}
for key, value in env_vars.items():
os.environ[key] = value
# Create cache directories
cache_dirs = [
cache_base / "infinity",
cache_base / "huggingface"
]
for cache_dir in cache_dirs:
cache_dir.mkdir(parents=True, exist_ok=True)
print(f"Created cache directory: {cache_dir}")
def check_dependencies():
"""Check if required dependencies are installed."""
# Map of import names to package names
required_imports = {
"torch": "torch",
"transformers": "transformers",
"huggingface_hub": "huggingface_hub",
"sentencepiece": "sentencepiece",
"google.protobuf": "protobuf", # Fixed: protobuf imports as google.protobuf
"torchao": "torchao"
}
missing_packages = []
for import_name, package_name in required_imports.items():
try:
__import__(import_name)
except ImportError:
missing_packages.append(package_name)
if missing_packages:
print(f"Missing required packages: {missing_packages}")
print("Please install them using:")
print("pip install torch>=2.7.0 transformers>=4.51.0 huggingface_hub[hf_transfer]==0.33.0 sentencepiece protobuf torchao --extra-index-url https://download.pytorch.org/whl/cu128")
print("pip install 'infinity-emb[torch,server] @git+https://github.com/aryasaatvik/infinity.git@dev#subdirectory=libs/infinity_emb'")
sys.exit(1)
def preload_models():
"""Preload models to cache them locally."""
print("Preloading models...")
cmd = f"infinity_emb v2 --model-id {EMBEDDER_MODEL} --model-id {RERANKER_MODEL} --preload-only"
try:
result = subprocess.run(cmd, shell=True, check=True, capture_output=True, text=True)
print("Models preloaded successfully")
if result.stdout:
print(result.stdout)
except subprocess.CalledProcessError as e:
print(f"Failed to preload models: {e}")
print(f"Error output: {e.stderr}")
raise
def serve_infinity():
"""
Run Infinity server with both embedding and reranking models.
Infinity can serve multiple models simultaneously, which is more
efficient than running separate instances.
"""
cmd = [
"infinity_emb",
"v2",
# Core settings
"--host", HOST,
"--port", str(PORT),
"--model-id", EMBEDDER_MODEL,
"--served-model-name", "Qwen/Qwen3-Embedding-0.6B",
"--model-id", RERANKER_MODEL,
"--served-model-name", "Qwen/Qwen3-Reranker-0.6B",
# Performance settings
"--batch-size", str(BATCH_SIZE),
"--device", "cuda",
"--engine", "torch",
"--pooling-method", "mean",
"--trust-remote-code",
"--no-bettertransformer",
"--log-level", "debug",
]
print(f"Starting Infinity server with command: {' '.join(cmd)}")
print(f"Server will be available at http://{HOST}:{PORT}")
try:
# Use subprocess.run to keep the process in foreground
subprocess.run(cmd, check=True)
except subprocess.CalledProcessError as e:
print(f"Server failed with exit code: {e.returncode}")
sys.exit(1)
except KeyboardInterrupt:
print("\nShutting down server...")
sys.exit(0)
def main():
"""Main function to start the Infinity server."""
print("Setting up environment...")
setup_environment()
print("Checking dependencies...")
check_dependencies()
print("Preloading models...")
preload_models()
print("Starting Infinity server...")
serve_infinity()
if __name__ == "__main__":
main()
Related: microsoft/onnxruntime#25083 for optimization. Bert should work for the Embedding and Reranking models. The generative Qwen3 models would probably benefit from the Phi optimizer, as they also use GQA and RoPE. The model support is also missing in huggingface/optimum-onnx.
Also will depend on #619 for supporting newer transformers packages.
@matfax I added a note on the issue you linked ; The qwen3-moe is supported in optimum-onnx, I just think the way it is constrained to be exported (looping over all experts) is suboptimal for inference.
@matfax I added a note on the issue you linked ; The qwen3-moe is supported in optimum-onnx, I just think the way it is constrained to be exported (looping over all experts) is suboptimal for inference.
Optimum seemed to have issues with Embedding in particular. I couldn't get this running on ONNX/TRT in a non-broken/accurate state. As far as I recall, they use an instruction that breaks the optimization, that's technically not in the opset version. The uploaded ONNX versions on HF either don't seem to work on onnxruntime or they work, but produce inaccurate vectors. There's also a number of other issues with the input mapping and weights that I don't recall. So it's not the export itself, but any attempt to optimize and quantize with optimum or TRT. And without any of these optimizations, GGUF seems to be the better pick. So, while it seems to be possible to run the Embedding and Reranking models, it's just not with any of the benefits one would expect in memory/inference speed.
https://forums.developer.nvidia.com/t/tensorrt-produce-all-zero-output-for-qwen3-embedding-0-6b/337047/4
When i try to upgrade the transformers, its saying that is then incompatible with the colpali-engine. I can not get it running with QWEN3-EMBEDDING-0.6B :( Any ideas?
In the Dockerfile i did this:
FROM michaelf34/infinity:0.0.77
RUN pip install --upgrade pip && \
pip install --upgrade \
transformers accelerate \
colpali-engine \
"numpy<2"
COPY ./entrypoint.sh /app/entrypoint.sh
RUN chmod +x /app/entrypoint.sh
ENTRYPOINT ["/app/entrypoint.sh"]
and this is the entrypoint.sh
#!/bin/sh
set -e
# THIS PATH IS CONFIRMED by our debug session for the :0.0.77 image.
PYTHON_EXEC="/usr/bin/python3.10"
# --- The rest of the script is correct ---
MODULE_PATH="infinity_emb.cli"
SERVER_COMMAND="$@"
MODEL_ID=$(echo "$@" | grep -o -P '(?<=--model-id )[^ ]+')
MODEL_DIR_NAME="models--$(echo "$MODEL_ID" | sed 's/\//--/g')"
CACHE_HOME=${HF_HOME:-/app/.cache}
MODEL_PATH="$CACHE_HOME/hub/$MODEL_DIR_NAME"
if [ -d "$MODEL_PATH" ]; then
echo "--- [INFO] Model '$MODEL_ID' already found in cache at '$MODEL_PATH'. Skipping download."
else
echo "--- [INFO] Model '$MODEL_ID' not found in cache. Starting one-time download..."
$PYTHON_EXEC -m $MODULE_PATH $SERVER_COMMAND --download-only
echo "--- [INFO] Model download complete."
fi
echo "--- [INFO] Starting Infinity server..."
exec $PYTHON_EXEC -m $MODULE_PATH $SERVER_COMMAND
I also wanted to have a model downloader and separate the model from the image itself to make it smaller. But when i run it i get:
infinity-embedding-qwen3-0_6_B-local | --- [INFO] Model 'Qwen/Qwen3-Embedding-0.6B' already found in cache at '/app/.cache/huggingface/hub/models--Qwen--Qwen3-Embedding-0.6B'. Skipping download.
infinity-embedding-qwen3-0_6_B-local | --- [INFO] Starting Infinity server...
infinity-embedding-qwen3-0_6_B-local | Traceback (most recent call last):
infinity-embedding-qwen3-0_6_B-local | File "/usr/lib/python3.10/runpy.py", line 187, in _run_module_as_main
infinity-embedding-qwen3-0_6_B-local | mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
infinity-embedding-qwen3-0_6_B-local | File "/usr/lib/python3.10/runpy.py", line 110, in _get_module_details
infinity-embedding-qwen3-0_6_B-local | __import__(pkg_name)
infinity-embedding-qwen3-0_6_B-local | File "/app/infinity_emb/__init__.py", line 6, in <module>
infinity-embedding-qwen3-0_6_B-local | from infinity_emb.args import EngineArgs # noqa: E402
infinity-embedding-qwen3-0_6_B-local | File "/app/infinity_emb/args.py", line 12, in <module>
infinity-embedding-qwen3-0_6_B-local | from infinity_emb.env import MANAGER
infinity-embedding-qwen3-0_6_B-local | File "/app/infinity_emb/env.py", line 12, in <module>
infinity-embedding-qwen3-0_6_B-local | from infinity_emb.primitives import (
infinity-embedding-qwen3-0_6_B-local | File "/app/infinity_emb/primitives.py", line 32, in <module>
infinity-embedding-qwen3-0_6_B-local | import numpy as np
infinity-embedding-qwen3-0_6_B-local | ModuleNotFoundError: No module named 'numpy'
infinity-embedding-qwen3-0_6_B-local exited with code 1
Infinity creates its own virtual environment inside the container, so try to run it from /app/.venv/bin/python3.10:
root@448acd6fd689:/app# /usr/bin/python3.10 -m infinity_emb
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 187, in _run_module_as_main
mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
File "/usr/lib/python3.10/runpy.py", line 146, in _get_module_details
return _get_module_details(pkg_main_name, error)
File "/usr/lib/python3.10/runpy.py", line 110, in _get_module_details
__import__(pkg_name)
File "/app/infinity_emb/__init__.py", line 6, in <module>
from infinity_emb.args import EngineArgs # noqa: E402
File "/app/infinity_emb/args.py", line 12, in <module>
from infinity_emb.env import MANAGER
File "/app/infinity_emb/env.py", line 12, in <module>
from infinity_emb.primitives import (
File "/app/infinity_emb/primitives.py", line 32, in <module>
import numpy as np
ModuleNotFoundError: No module named 'numpy'
root@448acd6fd689:/app# .venv/bin/python3 -m infinity_emb.cli --model_id Qwen/Qwen3-Embedding-0.6B
Working:
root@448acd6fd689:/app# .venv/bin/python3 -m infinity_emb.cli v2 --model-id Qwen/Qwen3-Embedding-0.6B
INFO: Started server process [642]
INFO: Waiting for application startup.
INFO 2025-11-14 00:56:57,796 infinity_emb INFO: Creating 1 infinity_server.py:84
engines: ['Qwen/Qwen3-Embedding-0.6B']
INFO 2025-11-14 00:56:57,798 infinity_emb INFO: Anonymized telemetry.py:30
telemetry can be disabled via environment variable
`DO_NOT_TRACK=1`.
INFO 2025-11-14 00:56:57,802 infinity_emb INFO: select_model.py:66
model=`Qwen/Qwen3-Embedding-0.6B` selected, using