TensorRT-LLM When will Gemma 3 be supported?

Mar 29 '25 03:03 bebilli

@bebilli

Hi bebill,

We haven't finalized the plan to support Gemma 3 yet. And if you have interest, you are welcome to contribute this model support to TensorRT-LLM and we can provide needed support and consulting.

June

Mar 29 '25 07:03 juney-nvidia

I'm just an AI application developer. Does adapting to Gemma3 require a strong and professional AI development background? If not, could you give me some guidance?

Mar 29 '25 19:03 bebilli

@bebilli

Hi,

I would recommend you use the PyTorch workflow to add Gemma 3 model support which can be less steep for AI application developers. You can follow this guide:

https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/torch/adding_new_model.md

and this example code(LLaMA) to add Gemma3

https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_torch/models/modeling_llama.py

For any specific questions when you hit with adding Gemma 3, pls let us know.

Thanks June

Mar 30 '25 00:03 juney-nvidia

@juney-nvidia If this method you mentioned is used, is it necessary to convert to the native TensorRT format before inference? If conversion is not required, can the performance match that of the native TensorRT format?

Mar 30 '25 00:03 bebilli

@juney-nvidia If this method you mentioned is used, is it necessary to convert to the native TensorRT format before inference? If conversion is not required, can the performance match that of the native TensorRT format?

For the PyTorch workflow, you don't need to convert the PyTorch model to TensorRT format, rather you need to follow this step-by-step guide to add your new model, including adding your model definition based on TensorRT-LLM PyTorch modeling API, implementing the weight loading logics.

As to performance, based on our internal benchmark, on key models such as LLaMA/Mistral/Mixtral, PyTorch workflow performance is on-par(or even faster) than the TensorRT workflow, due to that the customized performant kernels are reused in both TensorRT workflow(as TensorRT plugins) and PyTorch workflow(as torch custom op), and also the high performant C++ runtime building blocks(such as Batch Scheduler, KV Cache Manager, Dis-agg serving related logics) are also reused in both TensorRT and PyTorch workflow.

Also due to the flexibility of PyTorch, more optimizations can be quickly added to further push performance boundary.

For the latest announced world-class DeepSeek R1 performance number on Blackwell, they are all measured with the PyTorch workflow and we only support DeepSeek R1 in the PyTorch workflow for now.

Pls let me know if there is any further question.

Thanks June

Mar 30 '25 01:03 juney-nvidia

Thank you for your guidance. I'll go and give it a try.

Mar 30 '25 02:03 bebilli

Thank you for your guidance. I'll go and give it a try.

Thanks, looking forward to your contribution MR :)

June

Mar 30 '25 13:03 juney-nvidia

@bebilli any updates:))

Apr 07 '25 09:04 anonymousleek

@anonymousleek take a look at #3247

Apr 07 '25 19:04 InCogNiTo124

@InCogNiTo124 Does it support Gemma-3-27B?

Apr 09 '25 11:04 bebilli

@bebilli are you able to check if it support Gemma-3-27B?

Apr 26 '25 19:04 derektan5

@bebilli are you able to check if it support Gemma-3-27B?

not yet: https://github.com/NVIDIA/TensorRT-LLM/pull/3247#issuecomment-2790519367

Apr 26 '25 20:04 InCogNiTo124

I am also interested in Gemma-3 12B Q4 variety. If I have time, I will give it a shot.

--Chris

May 15 '25 20:05 davidcforbes

I am also interested in faster inference for gemma-3 27b 4 bit quantized model.

Would like to know if gemma 3 is supported by TensorRT LLM Engine in python.

Thank you

May 27 '25 03:05 Vedapani0402

Are there any changes on this topic?

Jun 08 '25 15:06 andrelohmann

Also interested in Gemma-3-27b

Jun 11 '25 19:06 rahchuenmonroe

hi any updated on gemma3 support?

Aug 22 '25 19:08 geraldstanje1

any update on gemma3 series and gemma3n support ?

Aug 27 '25 19:08 StephennFernandes

hi @juney-nvidia is gemma3 (google/gemma-3-4b-it) serving using triton inference server with tensorrt llm using python backend possible? im fine with text only for now...

Aug 28 '25 09:08 geraldstanje

Gemma 3 support started in TensorRT-LLM v0.19.0 with text-only models, and has been continuously enhanced with PyTorch workflow support (v0.20.0) and multimodal capabilities (v0.21.0). Also the support can be checked in https://nvidia.github.io/TensorRT-LLM/latest/reference/support-matrix.html

Sep 08 '25 19:09 karljang

hi @karljang what is PyTorch workflow - do you mean llm api? how can i quantize gemma3-4b (google/gemma-3-4b-it) model for text only? sand use PyTorch workflow of inference?

Sep 08 '25 20:09 geraldstanje

Hell @karljang and @juney-nvidia ,

I understand that the Pytorch workflow supports Gemma3 VLM models (Gemma3ForConditionalGeneration). However, like the comment above I am also interested in running these models quantized.

In my case I'm trying to run google/gemma-3-27b-it in 4-bit precision so that it fits on my RTX 5090.

I tried to download the main branch and run the trt-llm quantize script using int4_awq because it is mentioned as supported for gemma3 in the modelopt LLM PTQ docs [1]. When I run it I get an error mentioning "vocab_size" and the Gemma3Config:

`Traceback (most recent call last): File "/my-path/trt/quantize.py", line 160, in quantize_and_export(

File "/my-path/trt/trt/lib/python3.12/site-packages/tensorrt_llm/quantization/quantize_by_modelopt.py", line 817, in quantize_and_export export_tensorrt_llm_checkpoint(

File "/my-path/trt/trt/lib/python3.12/site-packages/modelopt/torch/export/model_config_export.py", line 555, in export_tensorrt_llm_checkpoint raise e

File "/my-path/trt/trt/lib/python3.12/site-packages/modelopt/torch/export/model_config_export.py", line 486, in export_tensorrt_llm_checkpoint for (

File "/my-path/trt/trt/lib/python3.12/site-packages/modelopt/torch/export/model_config_export.py", line 152, in torch_to_tensorrt_llm_checkpoint vocab_size = model.config.vocab_size ^^^^^^^^^^^^^^^^^^^^^^^

File "/my-path/trt/trt/lib/python3.12/site-packages/transformers/configuration_utils.py", line 211, in getattribute return super().getattribute(key) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

AttributeError: 'Gemma3Config' object has no attribute 'vocab_size' `

From looking at configuration_gemma3.py I can see the Gemma3Config class is for the Gemma3ForConditionalGeneration architecture but this config doesn't seem to be implemented in the quantization logic (quantize_by_modelopt.py or model_config_export.py).

I'm assuming this is because the TensorRT Model Optimizer doesn't support Gemma3 VLM models [2]?

My Ask: Is there a supported and/or recommended method to quantize the larger Gemma3 VLM models into 4-bit precision for use with the Pytorch workflow?

[1] https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/examples/llm_ptq/README.md#hugging-face-supported-models [2] https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/vlm_ptq#supported-models

Sep 13 '25 08:09 jordan-wriker

@brb-nv have the same question as above.

Sep 13 '25 08:09 bebilli

Issue has not received an update in over 14 days. Adding stale label.

Sep 27 '25 09:09 github-actions[bot]

hi @karljang and @juney-nvidia any update?

Sep 27 '25 19:09 geraldstanje1

@geraldstanje , I discussed this with the team, and it seems we currently don’t have plans to support Gemma 4-bit quantization, especially on RTX hardware.

Oct 29 '25 23:10 karljang

@karljang so for that case we need to use triton with vllm backend?

Oct 30 '25 02:10 geraldstanje1

@geraldstanje , Sorry, I'm not very familiar with vLLM's quantization capabilities. To clarify, my earlier comment wasn’t meant to suggest that Gemma3 4-bit quantization is impossible. I believe it can be achieved using TensorRT Model Optimizer. We're currently prioritizing support for other models and hardware platforms due to limited time and resources. That said, contributions are always welcome!

Oct 30 '25 04:10 karljang

Issue has not received an update in over 14 days. Adding stale label.

Nov 14 '25 03:11 github-actions[bot]

This issue was closed because it has been 14 days without activity since it has been marked as stale.

Nov 29 '25 03:11 github-actions[bot]