When will Gemma 3 be supported?
@bebilli
Hi bebill,
We haven't finalized the plan to support Gemma 3 yet. And if you have interest, you are welcome to contribute this model support to TensorRT-LLM and we can provide needed support and consulting.
June
I'm just an AI application developer. Does adapting to Gemma3 require a strong and professional AI development background? If not, could you give me some guidance?
@bebilli
Hi,
I would recommend you use the PyTorch workflow to add Gemma 3 model support which can be less steep for AI application developers. You can follow this guide:
- https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/torch/adding_new_model.md
and this example code(LLaMA) to add Gemma3
- https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_torch/models/modeling_llama.py
For any specific questions when you hit with adding Gemma 3, pls let us know.
Thanks June
@juney-nvidia If this method you mentioned is used, is it necessary to convert to the native TensorRT format before inference? If conversion is not required, can the performance match that of the native TensorRT format?
@juney-nvidia If this method you mentioned is used, is it necessary to convert to the native TensorRT format before inference? If conversion is not required, can the performance match that of the native TensorRT format?
For the PyTorch workflow, you don't need to convert the PyTorch model to TensorRT format, rather you need to follow this step-by-step guide to add your new model, including adding your model definition based on TensorRT-LLM PyTorch modeling API, implementing the weight loading logics.
As to performance, based on our internal benchmark, on key models such as LLaMA/Mistral/Mixtral, PyTorch workflow performance is on-par(or even faster) than the TensorRT workflow, due to that the customized performant kernels are reused in both TensorRT workflow(as TensorRT plugins) and PyTorch workflow(as torch custom op), and also the high performant C++ runtime building blocks(such as Batch Scheduler, KV Cache Manager, Dis-agg serving related logics) are also reused in both TensorRT and PyTorch workflow.
Also due to the flexibility of PyTorch, more optimizations can be quickly added to further push performance boundary.
For the latest announced world-class DeepSeek R1 performance number on Blackwell, they are all measured with the PyTorch workflow and we only support DeepSeek R1 in the PyTorch workflow for now.
Pls let me know if there is any further question.
Thanks June
Thank you for your guidance. I'll go and give it a try.
Thank you for your guidance. I'll go and give it a try.
Thanks, looking forward to your contribution MR :)
June
@bebilli any updates:))
@anonymousleek take a look at #3247
@InCogNiTo124 Does it support Gemma-3-27B?
@bebilli are you able to check if it support Gemma-3-27B?
@bebilli are you able to check if it support Gemma-3-27B?
not yet: https://github.com/NVIDIA/TensorRT-LLM/pull/3247#issuecomment-2790519367
I am also interested in Gemma-3 12B Q4 variety. If I have time, I will give it a shot.
--Chris
I am also interested in faster inference for gemma-3 27b 4 bit quantized model.
Would like to know if gemma 3 is supported by TensorRT LLM Engine in python.
Thank you
Are there any changes on this topic?
Also interested in Gemma-3-27b
hi any updated on gemma3 support?
any update on gemma3 series and gemma3n support ?
hi @juney-nvidia is gemma3 (google/gemma-3-4b-it) serving using triton inference server with tensorrt llm using python backend possible? im fine with text only for now...
Gemma 3 support started in TensorRT-LLM v0.19.0 with text-only models, and has been continuously enhanced with PyTorch workflow support (v0.20.0) and multimodal capabilities (v0.21.0). Also the support can be checked in https://nvidia.github.io/TensorRT-LLM/latest/reference/support-matrix.html
hi @karljang what is PyTorch workflow - do you mean llm api? how can i quantize gemma3-4b (google/gemma-3-4b-it) model for text only? sand use PyTorch workflow of inference?
Hell @karljang and @juney-nvidia ,
I understand that the Pytorch workflow supports Gemma3 VLM models (Gemma3ForConditionalGeneration). However, like the comment above I am also interested in running these models quantized.
In my case I'm trying to run google/gemma-3-27b-it in 4-bit precision so that it fits on my RTX 5090.
I tried to download the main branch and run the trt-llm quantize script using int4_awq because it is mentioned as supported for gemma3 in the modelopt LLM PTQ docs [1]. When I run it I get an error mentioning "vocab_size" and the Gemma3Config:
`Traceback (most recent call last):
File "/my-path/trt/quantize.py", line 160, in
File "/my-path/trt/trt/lib/python3.12/site-packages/tensorrt_llm/quantization/quantize_by_modelopt.py", line 817, in quantize_and_export export_tensorrt_llm_checkpoint(
File "/my-path/trt/trt/lib/python3.12/site-packages/modelopt/torch/export/model_config_export.py", line 555, in export_tensorrt_llm_checkpoint raise e
File "/my-path/trt/trt/lib/python3.12/site-packages/modelopt/torch/export/model_config_export.py", line 486, in export_tensorrt_llm_checkpoint for (
File "/my-path/trt/trt/lib/python3.12/site-packages/modelopt/torch/export/model_config_export.py", line 152, in torch_to_tensorrt_llm_checkpoint vocab_size = model.config.vocab_size ^^^^^^^^^^^^^^^^^^^^^^^
File "/my-path/trt/trt/lib/python3.12/site-packages/transformers/configuration_utils.py", line 211, in getattribute return super().getattribute(key) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'Gemma3Config' object has no attribute 'vocab_size' `
From looking at configuration_gemma3.py I can see the Gemma3Config class is for the Gemma3ForConditionalGeneration architecture but this config doesn't seem to be implemented in the quantization logic (quantize_by_modelopt.py or model_config_export.py).
I'm assuming this is because the TensorRT Model Optimizer doesn't support Gemma3 VLM models [2]?
My Ask: Is there a supported and/or recommended method to quantize the larger Gemma3 VLM models into 4-bit precision for use with the Pytorch workflow?
[1] https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/examples/llm_ptq/README.md#hugging-face-supported-models [2] https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/vlm_ptq#supported-models
@brb-nv have the same question as above.
Issue has not received an update in over 14 days. Adding stale label.
hi @karljang and @juney-nvidia any update?
@geraldstanje , I discussed this with the team, and it seems we currently don’t have plans to support Gemma 4-bit quantization, especially on RTX hardware.
@karljang so for that case we need to use triton with vllm backend?
@geraldstanje , Sorry, I'm not very familiar with vLLM's quantization capabilities. To clarify, my earlier comment wasn’t meant to suggest that Gemma3 4-bit quantization is impossible. I believe it can be achieved using TensorRT Model Optimizer. We're currently prioritizing support for other models and hardware platforms due to limited time and resources. That said, contributions are always welcome!
Issue has not received an update in over 14 days. Adding stale label.
This issue was closed because it has been 14 days without activity since it has been marked as stale.