infinity Qwen3-Embedding models do not work

Model description

The qwen3 models easily outperform nearly every other open source model for embeddings, however it does not work in infinity due to outdated transformers. My docker compose file:

version: '3.8'
services:
  infinity:
    image: michaelf34/infinity:latest
    environment:
      DO_NOT_TRACK: 1                           # Disable telemetry
      INFINITY_BETTERTRANSFORMER: True
      HF_HOME: /app/data                        # Use /app/data
      INFINITY_MODEL_ID: Qwen/Qwen3-Embedding-4B;Qwen/Qwen3-Reranker-4B    # Model(s), semicolon separated
      INFINITY_PORT: 7997                      # Port
      INFINITY_API_KEY: foo                    # Optional API key
      INFINITY_DEVICE: cuda
      INFINITY_VECTOR_DISK_CACHE: True
    volumes:
      - ./infinity:/app/data:rw                # Persist /app/data to ./infinity in the current directory
      - ./models:/data/models:ro               # Mount local models to /data/models in read-only mode
      - ./infinity:/data/hf_cache:rw           # Mount cache data to /data/hf_cache in read-write mode
    ports:
      - "7997:7997"  # Flexible port mapping
    command: v2
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

volumes:
  infinity:                                     # Named volume declaration

This results in the error:

infinity-1  | ValueError: The checkpoint you are trying to load has model type `qwen3` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.

To fix this, you should just be able to update transformers to >4.51.0 as per the qwen documentation.

Open source status & huggingface transformers.

[x] The model implementation is available on transformers
[x] The model weights are available on huggingface-hub
[x] I verified that the model is currently not running in the latest version pip install infinity_emb[all] --upgrade
[ ] I made the authors of the model aware that I want to use it with infinity_emb & check if they are aware of the issue.

Jun 30 '25 03:06 GhostDog98

Hi,

thank you for reporting this. Related issue: https://github.com/michaelfeil/infinity/issues/598

Until a new docker image is released, you should be able to build a custom docker image by following the suggestions in the linked issue.

Jun 30 '25 03:06 wirthual

Hi,

thank you for reporting this. Related issue: #598

Until a new docker image is released, you should be able to build a custom docker image by following the suggestions in the linked issue.

Is there any way to do this without needing to completely rebuild all of the docker containers? I'm aware I could technically do something like

sudo docker compose run -it --entrypoint sh infinity
pip install --upgrade transformers
pip install --upgrade accelerate
infinity_emb v2

But that seems really arduous and kind of janky

Jun 30 '25 03:06 GhostDog98

Okay i found a kinda workaround, if you simply add the following line to your docker-compose it will update those packages every start before starting infinity:

    entrypoint: ["sh", "-cv", "pip install --upgrade transformers accelerate && infinity_emb v2"]

Jun 30 '25 04:06 GhostDog98

You don't need to run the commands on every start. Just build a custom image where those commands are run when building the image.

Create a "Dockerfile" with the following:

FROM michaelf34/infinity:latest
RUN pip install --upgrade transformers accelerate

Now change your docker-compose.yaml file from this:

infinity:
    image: michaelf34/infinity:latest

...to this:

infinity:
    build: .                         # Build from the Dockerfile in this directory
    image: infinity-custom:latest    # What to call this new image

Run docker compose up -d and you're done. Then revert to the original image once it has been patched.

Jul 05 '25 12:07 okpedro

I'm aware, however constraints in my current environment mean that injecting the commands is the simplest way to do things, though your approach is better if a user is able to build another image etc.

Jul 06 '25 22:07 GhostDog98

I updated transformers but it still not support Qwen3-reranker-8B model。 Here is my dockerfile:

FROM michaelf34/infinity:0.0.76
RUN pip install --upgrade transformers accelerate

docker-compose.yml:

services:
  qwen3-reranker-8b:
    container_name: qwen3-reranker-8b
    #image: vllm/vllm-openai:v0.8.5
    image: michaelf34/infinity:0.0.76-update
    restart: always
    #command: |
    #        --model "/models/Qwen3-Reranker-8B" --served-model-name Qwen/Qwen3-Reranker-8B --port 7997 --hf_overrides '{"architectures": ["Qwen3ForSequenceClassification"],"classifier_from_token": ["no", "yes"],"is_original_qwen3_reranker": true}' --task score
    command: |
      v2 --model-id "/models/Qwen3-Reranker-8B" --served-model-name Qwen/Qwen3-Reranker-8B --revision "main" --dtype bfloat16 --batch-size 32 --device cuda --engine torch --port 7997
    tty: true
    volumes:
      - "/etc/localtime:/etc/localtime:ro"
      - "/home/qingfu.zeng/Qwen:/models"
    deploy:
      resources:
        reservations:
          devices:
          - driver: nvidia
            device_ids: ['4']
            capabilities: [gpu]
          memory: 10G
        limits:
          memory: 20G
    ports:
      - "7997:7997"

After deployed, the rerank api call returns error. api call command:

curl -X 'POST'   'http://172.16.30.224:7997/rerank'   -H 'accept: application/json'   -H 'Content-Type: application/json'   -d '{
  "query": "string",
  "documents": [
    "string"
  ],
  "return_documents": false,
  "raw_scores": false,
  "model": "Qwen/Qwen3-Reranker-8B",
  "top_n": 1
}'

errors:

{
  "error": {
    "message": "ModelNotDeployedError: model=`/models/Qwen3-Reranker-8B` does not support `rerank`. Reason: the loaded moded cannot fullyfill `rerank`. Options are {'embed'}.",
    "type": null,
    "param": null,
    "code": 400
  }
}

Jul 17 '25 08:07 zengqingfu1442

I use this code to run Qwen3 embedding and reranking models:

"""
Standalone deployment for embeddings and reranking using Infinity.
This deployment includes:
- Embeddings: Qwen3-Embedding-0.6B  
- Reranking: Qwen/Qwen3-Reranker-0.6B
"""
import subprocess
import sys
import os
import logging
from pathlib import Path

# Model configuration
EMBEDDER_MODEL = "Qwen/Qwen3-Embedding-0.6B"
RERANKER_MODEL = "Qwen/Qwen3-Reranker-0.6B"

# Configuration
PORT = 7997
BATCH_SIZE = 6
HOST = "0.0.0.0"

# Environment setup
def setup_environment():
    """Set up environment variables for optimal performance."""
    # Use current directory for cache instead of /app
    current_dir = Path.cwd()
    cache_base = current_dir / ".cache"
    
    env_vars = {
        "INFINITY_QUEUE_SIZE": "2048",
        "INFINITY_HOME": str(cache_base / "infinity"),
        "HF_HOME": str(cache_base / "huggingface"), 
        "HF_HUB_ENABLE_HF_TRANSFER": "1",
        "PYTORCH_CUDA_ALLOC_CONF": "expandable_segments:True",
    }
    
    for key, value in env_vars.items():
        os.environ[key] = value
        
    # Create cache directories
    cache_dirs = [
        cache_base / "infinity",
        cache_base / "huggingface"
    ]
    
    for cache_dir in cache_dirs:
        cache_dir.mkdir(parents=True, exist_ok=True)
        print(f"Created cache directory: {cache_dir}")

def check_dependencies():
    """Check if required dependencies are installed."""
    # Map of import names to package names
    required_imports = {
        "torch": "torch",
        "transformers": "transformers", 
        "huggingface_hub": "huggingface_hub",
        "sentencepiece": "sentencepiece",
        "google.protobuf": "protobuf",  # Fixed: protobuf imports as google.protobuf
        "torchao": "torchao"
    }
    
    missing_packages = []
    for import_name, package_name in required_imports.items():
        try:
            __import__(import_name)
        except ImportError:
            missing_packages.append(package_name)
    
    if missing_packages:
        print(f"Missing required packages: {missing_packages}")
        print("Please install them using:")
        print("pip install torch>=2.7.0 transformers>=4.51.0 huggingface_hub[hf_transfer]==0.33.0 sentencepiece protobuf torchao --extra-index-url https://download.pytorch.org/whl/cu128")
        print("pip install 'infinity-emb[torch,server] @git+https://github.com/aryasaatvik/infinity.git@dev#subdirectory=libs/infinity_emb'")
        sys.exit(1)

def preload_models():
    """Preload models to cache them locally."""
    print("Preloading models...")
    cmd = f"infinity_emb v2 --model-id {EMBEDDER_MODEL} --model-id {RERANKER_MODEL} --preload-only"
    
    try:
        result = subprocess.run(cmd, shell=True, check=True, capture_output=True, text=True)
        print("Models preloaded successfully")
        if result.stdout:
            print(result.stdout)
    except subprocess.CalledProcessError as e:
        print(f"Failed to preload models: {e}")
        print(f"Error output: {e.stderr}")
        raise

def serve_infinity():
    """
    Run Infinity server with both embedding and reranking models.
    Infinity can serve multiple models simultaneously, which is more
    efficient than running separate instances.
    """
    cmd = [
        "infinity_emb",
        "v2",
        # Core settings
        "--host", HOST,
        "--port", str(PORT),
        "--model-id", EMBEDDER_MODEL,
        "--served-model-name", "Qwen/Qwen3-Embedding-0.6B",
        "--model-id", RERANKER_MODEL,
        "--served-model-name", "Qwen/Qwen3-Reranker-0.6B",
        # Performance settings
        "--batch-size", str(BATCH_SIZE),
        "--device", "cuda",
        "--engine", "torch",
        "--pooling-method", "mean",
        "--trust-remote-code",
        "--no-bettertransformer",
        "--log-level", "debug",
    ]
    
    print(f"Starting Infinity server with command: {' '.join(cmd)}")
    print(f"Server will be available at http://{HOST}:{PORT}")
    
    try:
        # Use subprocess.run to keep the process in foreground
        subprocess.run(cmd, check=True)
    except subprocess.CalledProcessError as e:
        print(f"Server failed with exit code: {e.returncode}")
        sys.exit(1)
    except KeyboardInterrupt:
        print("\nShutting down server...")
        sys.exit(0)

def main():
    """Main function to start the Infinity server."""
    print("Setting up environment...")
    setup_environment()
    
    print("Checking dependencies...")
    check_dependencies()
    
    print("Preloading models...")
    preload_models()
    
    print("Starting Infinity server...")
    serve_infinity()

if __name__ == "__main__":
    main()

Aug 18 '25 10:08 bakongi

Related: microsoft/onnxruntime#25083 for optimization. Bert should work for the Embedding and Reranking models. The generative Qwen3 models would probably benefit from the Phi optimizer, as they also use GQA and RoPE. The model support is also missing in huggingface/optimum-onnx.

Also will depend on #619 for supporting newer transformers packages.

Aug 23 '25 14:08 matfax

@matfax I added a note on the issue you linked ; The qwen3-moe is supported in optimum-onnx, I just think the way it is constrained to be exported (looping over all experts) is suboptimal for inference.

Oct 06 '25 12:10 IlyasMoutawwakil

@matfax I added a note on the issue you linked ; The qwen3-moe is supported in optimum-onnx, I just think the way it is constrained to be exported (looping over all experts) is suboptimal for inference.

Optimum seemed to have issues with Embedding in particular. I couldn't get this running on ONNX/TRT in a non-broken/accurate state. As far as I recall, they use an instruction that breaks the optimization, that's technically not in the opset version. The uploaded ONNX versions on HF either don't seem to work on onnxruntime or they work, but produce inaccurate vectors. There's also a number of other issues with the input mapping and weights that I don't recall. So it's not the export itself, but any attempt to optimize and quantize with optimum or TRT. And without any of these optimizations, GGUF seems to be the better pick. So, while it seems to be possible to run the Embedding and Reranking models, it's just not with any of the benefits one would expect in memory/inference speed.

https://forums.developer.nvidia.com/t/tensorrt-produce-all-zero-output-for-qwen3-embedding-0-6b/337047/4

Oct 06 '25 13:10 matfax

When i try to upgrade the transformers, its saying that is then incompatible with the colpali-engine. I can not get it running with QWEN3-EMBEDDING-0.6B :( Any ideas?

In the Dockerfile i did this:

FROM michaelf34/infinity:0.0.77

RUN pip install --upgrade pip && \
    pip install --upgrade \
        transformers accelerate \
        colpali-engine \
        "numpy<2"

COPY ./entrypoint.sh /app/entrypoint.sh
RUN chmod +x /app/entrypoint.sh
ENTRYPOINT ["/app/entrypoint.sh"]

and this is the entrypoint.sh

#!/bin/sh
set -e

# THIS PATH IS CONFIRMED by our debug session for the :0.0.77 image.
PYTHON_EXEC="/usr/bin/python3.10"

# --- The rest of the script is correct ---
MODULE_PATH="infinity_emb.cli"
SERVER_COMMAND="$@"
MODEL_ID=$(echo "$@" | grep -o -P '(?<=--model-id )[^ ]+')
MODEL_DIR_NAME="models--$(echo "$MODEL_ID" | sed 's/\//--/g')"
CACHE_HOME=${HF_HOME:-/app/.cache}
MODEL_PATH="$CACHE_HOME/hub/$MODEL_DIR_NAME"

if [ -d "$MODEL_PATH" ]; then
    echo "--- [INFO] Model '$MODEL_ID' already found in cache at '$MODEL_PATH'. Skipping download."
else
    echo "--- [INFO] Model '$MODEL_ID' not found in cache. Starting one-time download..."
    $PYTHON_EXEC -m $MODULE_PATH $SERVER_COMMAND --download-only
    echo "--- [INFO] Model download complete."
fi

echo "--- [INFO] Starting Infinity server..."
exec $PYTHON_EXEC -m $MODULE_PATH $SERVER_COMMAND

I also wanted to have a model downloader and separate the model from the image itself to make it smaller. But when i run it i get:

infinity-embedding-qwen3-0_6_B-local  | --- [INFO] Model 'Qwen/Qwen3-Embedding-0.6B' already found in cache at '/app/.cache/huggingface/hub/models--Qwen--Qwen3-Embedding-0.6B'. Skipping download.
infinity-embedding-qwen3-0_6_B-local  | --- [INFO] Starting Infinity server...
infinity-embedding-qwen3-0_6_B-local  | Traceback (most recent call last):
infinity-embedding-qwen3-0_6_B-local  |   File "/usr/lib/python3.10/runpy.py", line 187, in _run_module_as_main
infinity-embedding-qwen3-0_6_B-local  |     mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
infinity-embedding-qwen3-0_6_B-local  |   File "/usr/lib/python3.10/runpy.py", line 110, in _get_module_details
infinity-embedding-qwen3-0_6_B-local  |     __import__(pkg_name)
infinity-embedding-qwen3-0_6_B-local  |   File "/app/infinity_emb/__init__.py", line 6, in <module>
infinity-embedding-qwen3-0_6_B-local  |     from infinity_emb.args import EngineArgs  # noqa: E402
infinity-embedding-qwen3-0_6_B-local  |   File "/app/infinity_emb/args.py", line 12, in <module>
infinity-embedding-qwen3-0_6_B-local  |     from infinity_emb.env import MANAGER
infinity-embedding-qwen3-0_6_B-local  |   File "/app/infinity_emb/env.py", line 12, in <module>
infinity-embedding-qwen3-0_6_B-local  |     from infinity_emb.primitives import (
infinity-embedding-qwen3-0_6_B-local  |   File "/app/infinity_emb/primitives.py", line 32, in <module>
infinity-embedding-qwen3-0_6_B-local  |     import numpy as np
infinity-embedding-qwen3-0_6_B-local  | ModuleNotFoundError: No module named 'numpy'
infinity-embedding-qwen3-0_6_B-local exited with code 1

Nov 13 '25 12:11 arnonuem

Infinity creates its own virtual environment inside the container, so try to run it from /app/.venv/bin/python3.10:

root@448acd6fd689:/app# /usr/bin/python3.10 -m infinity_emb
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 187, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
  File "/usr/lib/python3.10/runpy.py", line 146, in _get_module_details
    return _get_module_details(pkg_main_name, error)
  File "/usr/lib/python3.10/runpy.py", line 110, in _get_module_details
    __import__(pkg_name)
  File "/app/infinity_emb/__init__.py", line 6, in <module>
    from infinity_emb.args import EngineArgs  # noqa: E402
  File "/app/infinity_emb/args.py", line 12, in <module>
    from infinity_emb.env import MANAGER
  File "/app/infinity_emb/env.py", line 12, in <module>
    from infinity_emb.primitives import (
  File "/app/infinity_emb/primitives.py", line 32, in <module>
    import numpy as np
ModuleNotFoundError: No module named 'numpy'
root@448acd6fd689:/app# .venv/bin/python3 -m infinity_emb.cli --model_id Qwen/Qwen3-Embedding-0.6B

Working:

root@448acd6fd689:/app# .venv/bin/python3 -m infinity_emb.cli v2 --model-id Qwen/Qwen3-Embedding-0.6B
INFO:     Started server process [642]
INFO:     Waiting for application startup.
INFO     2025-11-14 00:56:57,796 infinity_emb INFO: Creating 1    infinity_server.py:84
         engines: ['Qwen/Qwen3-Embedding-0.6B']                                        
INFO     2025-11-14 00:56:57,798 infinity_emb INFO: Anonymized          telemetry.py:30
         telemetry can be disabled via environment variable                            
         `DO_NOT_TRACK=1`.                                                             
INFO     2025-11-14 00:56:57,802 infinity_emb INFO:                  select_model.py:66
         model=`Qwen/Qwen3-Embedding-0.6B` selected, using

Nov 14 '25 00:11 wirthual