Support google/embeddinggemma-300m and Qwen/Qwen3-Reranker-0.6B
Model description
google/embeddinggemma-300m
embedding-server | INFO: Waiting for application startup.
embedding-server | INFO 2025-09-18 15:07:20,029 infinity_emb INFO: infinity_server.py:84
embedding-server | Creating 2 engines:
embedding-server | ['google/embeddinggemma-300m',
embedding-server | 'Qwen/Qwen3-Reranker-0.6B']
embedding-server | INFO 2025-09-18 15:07:20,031 infinity_emb INFO: telemetry.py:34
embedding-server | DO_NOT_TRACK=1 registered. Anonymized usage statistics
embedding-server | are disabled.
embedding-server | INFO 2025-09-18 15:07:20,034 infinity_emb INFO: select_model.py:66
embedding-server | model=`google/embeddinggemma-300m` selected, using
embedding-server | engine=`torch` and device=`cuda`
embedding-server | ERROR: Traceback (most recent call last):
embedding-server | File "/app/.venv/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 1082, in from_pretrained
embedding-server | config_class = CONFIG_MAPPING[config_dict["model_type"]]
embedding-server | File "/app/.venv/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 784, in __getitem__
embedding-server | raise KeyError(key)
embedding-server | KeyError: 'gemma3_text'
embedding-server |
embedding-server | During handling of the above exception, another exception occurred:
embedding-server |
embedding-server | Traceback (most recent call last):
embedding-server | File "/app/.venv/lib/python3.10/site-packages/starlette/routing.py", line 693, in lifespan
embedding-server | async with self.lifespan_context(app) as maybe_state:
embedding-server | File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
embedding-server | return await anext(self.gen)
embedding-server | File "/app/infinity_emb/infinity_server.py", line 88, in lifespan
embedding-server | app.engine_array = AsyncEngineArray.from_args(engine_args_list) # type: ignore
embedding-server | File "/app/infinity_emb/engine.py", line 306, in from_args
embedding-server | return cls(engines=tuple(engines))
embedding-server | File "/app/infinity_emb/engine.py", line 71, in from_args
embedding-server | engine = cls(**engine_args.to_dict(), _show_deprecation_warning=False)
embedding-server | File "/app/infinity_emb/engine.py", line 56, in __init__
embedding-server | self._model_replicas, self._min_inference_t, self._max_inference_t = select_model(
embedding-server | File "/app/infinity_emb/inference/select_model.py", line 83, in select_model
embedding-server | loaded_engine = unloaded_engine.value(engine_args=engine_args_copy)
embedding-server | File "/app/infinity_emb/transformer/embedder/sentence_transformer.py", line 62, in __init__
embedding-server | attempt_bt = check_if_bettertransformer_possible(engine_args)
embedding-server | File "/app/infinity_emb/transformer/acceleration.py", line 40, in check_if_bettertransformer_possible
embedding-server | config = AutoConfig.from_pretrained(
embedding-server | File "/app/.venv/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 1084, in from_pretrained
embedding-server | raise ValueError(
embedding-server | ValueError: The checkpoint you are trying to load has model type `gemma3_text` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.
embedding-server |
embedding-server | You can update Transformers with the command `pip install --upgrade transformers`. If this does not work, and the checkpoint is very new, then there may not be a release version that supports this model yet. In this case, you can get the most up-to-date code by installing Transformers from source with the command `pip install git+https://github.com/huggingface/transformers.git`
embedding-server |
embedding-server | ERROR: Application startup failed. Exiting.
WARN[0010] optional dependency "embedding-server" failed to start: container embedding-server exited (3)
embedding-server exited with code 3
Qwen/Qwen3-Reranker-0.6B
embedding-server | INFO: Started server process [1]
embedding-server | INFO: Waiting for application startup.
embedding-server | INFO: Started server process [1]
embedding-server | INFO: Waiting for application startup.
embedding-server | INFO 2025-09-18 15:11:31,772 infinity_emb INFO: infinity_server.py:84
embedding-server | Creating 1 engines: ['Qwen/Qwen3-Reranker-0.6B']
embedding-server | INFO 2025-09-18 15:11:31,774 infinity_emb INFO: telemetry.py:34
embedding-server | DO_NOT_TRACK=1 registered. Anonymized usage statistics
embedding-server | are disabled.
embedding-server | INFO 2025-09-18 15:11:31,772 infinity_emb INFO: infinity_server.py:84
embedding-server | Creating 1 engines: ['Qwen/Qwen3-Reranker-0.6B']
embedding-server | INFO 2025-09-18 15:11:31,774 infinity_emb INFO: telemetry.py:34
embedding-server | DO_NOT_TRACK=1 registered. Anonymized usage statistics
embedding-server | are disabled.
embedding-server | INFO 2025-09-18 15:11:31,777 infinity_emb INFO: select_model.py:66
embedding-server | model=`Qwen/Qwen3-Reranker-0.6B` selected, using
embedding-server | engine=`torch` and device=`cuda`
embedding-server | INFO 2025-09-18 15:11:31,777 infinity_emb INFO: select_model.py:66
embedding-server | model=`Qwen/Qwen3-Reranker-0.6B` selected, using
embedding-server | engine=`torch` and device=`cuda`
embedding-server | ERROR: Traceback (most recent call last):
embedding-server | File "/app/.venv/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 1082, in from_pretrained
embedding-server | config_class = CONFIG_MAPPING[config_dict["model_type"]]
embedding-server | File "/app/.venv/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 784, in __getitem__
embedding-server | raise KeyError(key)
embedding-server | KeyError: 'qwen3'
embedding-server |
embedding-server | During handling of the above exception, another exception occurred:
embedding-server |
embedding-server | Traceback (most recent call last):
embedding-server | File "/app/.venv/lib/python3.10/site-packages/starlette/routing.py", line 693, in lifespan
embedding-server | async with self.lifespan_context(app) as maybe_state:
embedding-server | File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
embedding-server | return await anext(self.gen)
embedding-server | File "/app/infinity_emb/infinity_server.py", line 88, in lifespan
embedding-server | app.engine_array = AsyncEngineArray.from_args(engine_args_list) # type: ignore
embedding-server | File "/app/infinity_emb/engine.py", line 306, in from_args
embedding-server | return cls(engines=tuple(engines))
embedding-server | File "/app/infinity_emb/engine.py", line 71, in from_args
embedding-server | engine = cls(**engine_args.to_dict(), _show_deprecation_warning=False)
embedding-server | File "/app/infinity_emb/engine.py", line 56, in __init__
embedding-server | self._model_replicas, self._min_inference_t, self._max_inference_t = select_model(
embedding-server | File "/app/infinity_emb/inference/select_model.py", line 83, in select_model
embedding-server | loaded_engine = unloaded_engine.value(engine_args=engine_args_copy)
embedding-server | File "/app/infinity_emb/transformer/embedder/sentence_transformer.py", line 62, in __init__
embedding-server | attempt_bt = check_if_bettertransformer_possible(engine_args)
embedding-server | File "/app/infinity_emb/transformer/acceleration.py", line 40, in check_if_bettertransformer_possible
embedding-server | config = AutoConfig.from_pretrained(
embedding-server | File "/app/.venv/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 1084, in from_pretrained
embedding-server | raise ValueError(
embedding-server | ValueError: The checkpoint you are trying to load has model type `qwen3` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.
embedding-server |
embedding-server | You can update Transformers with the command `pip install --upgrade transformers`. If this does not work, and the checkpoint is very new, then there may not be a release version that supports this model yet. In this case, you can get the most up-to-date code by installing Transformers from source with the command `pip install git+https://github.com/huggingface/transformers.git`
embedding-server |
embedding-server | ERROR: Application startup failed. Exiting.
embedding-server | ERROR: Traceback (most recent call last):
embedding-server | File "/app/.venv/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 1082, in from_pretrained
embedding-server | config_class = CONFIG_MAPPING[config_dict["model_type"]]
embedding-server | File "/app/.venv/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 784, in __getitem__
embedding-server | raise KeyError(key)
embedding-server | KeyError: 'qwen3'
embedding-server |
embedding-server | During handling of the above exception, another exception occurred:
embedding-server |
embedding-server | Traceback (most recent call last):
embedding-server | File "/app/.venv/lib/python3.10/site-packages/starlette/routing.py", line 693, in lifespan
embedding-server | async with self.lifespan_context(app) as maybe_state:
embedding-server | File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
embedding-server | return await anext(self.gen)
embedding-server | File "/app/infinity_emb/infinity_server.py", line 88, in lifespan
embedding-server | app.engine_array = AsyncEngineArray.from_args(engine_args_list) # type: ignore
embedding-server | File "/app/infinity_emb/engine.py", line 306, in from_args
embedding-server | return cls(engines=tuple(engines))
embedding-server | File "/app/infinity_emb/engine.py", line 71, in from_args
embedding-server | engine = cls(**engine_args.to_dict(), _show_deprecation_warning=False)
embedding-server | File "/app/infinity_emb/engine.py", line 56, in __init__
embedding-server | self._model_replicas, self._min_inference_t, self._max_inference_t = select_model(
embedding-server | File "/app/infinity_emb/inference/select_model.py", line 83, in select_model
embedding-server | loaded_engine = unloaded_engine.value(engine_args=engine_args_copy)
embedding-server | File "/app/infinity_emb/transformer/embedder/sentence_transformer.py", line 62, in __init__
embedding-server | attempt_bt = check_if_bettertransformer_possible(engine_args)
embedding-server | File "/app/infinity_emb/transformer/acceleration.py", line 40, in check_if_bettertransformer_possible
embedding-server | config = AutoConfig.from_pretrained(
embedding-server | File "/app/.venv/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 1084, in from_pretrained
embedding-server | raise ValueError(
embedding-server | ValueError: The checkpoint you are trying to load has model type `qwen3` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.
embedding-server |
embedding-server | You can update Transformers with the command `pip install --upgrade transformers`. If this does not work, and the checkpoint is very new, then there may not be a release version that supports this model yet. In this case, you can get the most up-to-date code by installing Transformers from source with the command `pip install git+https://github.com/huggingface/transformers.git`
embedding-server |
embedding-server | ERROR: Application startup failed. Exiting.
WARN[0011] optional dependency "embedding-server" failed to start: container embedding-server exited (3)
embedding-server exited with code 3
Open source status & huggingface transformers.
- [x] The model implementation is available on transformers
- [x] The model weights are available on huggingface-hub
- [x] I verified that the model is currently not running in the latest version
pip install infinity_emb[all] --upgrade - [x] I made the authors of the model aware that I want to use it with infinity_emb & check if they are aware of the issue.
I updated transformers, accelerate and sentence_transformers and was able to run embeddinggemma-300m. But the embedding returned seems to be wrong. Has anyone got this working e.g. below is the embedding for "apple":
{"object":"list","data":[{"object":"embedding","embedding":[0.02299346774816513,0.04923601448535919,0.008185175247490406,0.04723658040165901,-0.0192445330321788
which using sentence transformers in python gives:
[-0.18476315 0.00167681 0.03773482 ... -0.07996223 -0.02348067 0.00976739]
I cross-checked with TEI and was able to get correct embeddings with TEI.
OK. After further testing. I confirmed that the discrepancy is due to infinity only implementing the transformer backbone together with the pooling. embeddinggemma-300m has after this an additional 2 dense modules which are applied as a kind of post-processing step. Without these steps, the embeddings are different and the quality/performance is adversely impacted.