TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

When will Gemma 3 be supported?

Open bebilli opened this issue 9 months ago • 5 comments

bebilli avatar Mar 29 '25 03:03 bebilli

@bebilli

Hi bebill,

We haven't finalized the plan to support Gemma 3 yet. And if you have interest, you are welcome to contribute this model support to TensorRT-LLM and we can provide needed support and consulting.

June

juney-nvidia avatar Mar 29 '25 07:03 juney-nvidia

I'm just an AI application developer. Does adapting to Gemma3 require a strong and professional AI development background? If not, could you give me some guidance?

bebilli avatar Mar 29 '25 19:03 bebilli

@bebilli

Hi,

I would recommend you use the PyTorch workflow to add Gemma 3 model support which can be less steep for AI application developers. You can follow this guide:

  • https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/torch/adding_new_model.md

and this example code(LLaMA) to add Gemma3

  • https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_torch/models/modeling_llama.py

For any specific questions when you hit with adding Gemma 3, pls let us know.

Thanks June

juney-nvidia avatar Mar 30 '25 00:03 juney-nvidia

@juney-nvidia If this method you mentioned is used, is it necessary to convert to the native TensorRT format before inference? If conversion is not required, can the performance match that of the native TensorRT format?

bebilli avatar Mar 30 '25 00:03 bebilli

@juney-nvidia If this method you mentioned is used, is it necessary to convert to the native TensorRT format before inference? If conversion is not required, can the performance match that of the native TensorRT format?

For the PyTorch workflow, you don't need to convert the PyTorch model to TensorRT format, rather you need to follow this step-by-step guide to add your new model, including adding your model definition based on TensorRT-LLM PyTorch modeling API, implementing the weight loading logics.

As to performance, based on our internal benchmark, on key models such as LLaMA/Mistral/Mixtral, PyTorch workflow performance is on-par(or even faster) than the TensorRT workflow, due to that the customized performant kernels are reused in both TensorRT workflow(as TensorRT plugins) and PyTorch workflow(as torch custom op), and also the high performant C++ runtime building blocks(such as Batch Scheduler, KV Cache Manager, Dis-agg serving related logics) are also reused in both TensorRT and PyTorch workflow.

Also due to the flexibility of PyTorch, more optimizations can be quickly added to further push performance boundary.

For the latest announced world-class DeepSeek R1 performance number on Blackwell, they are all measured with the PyTorch workflow and we only support DeepSeek R1 in the PyTorch workflow for now.

Pls let me know if there is any further question.

Thanks June

juney-nvidia avatar Mar 30 '25 01:03 juney-nvidia

Thank you for your guidance. I'll go and give it a try.

bebilli avatar Mar 30 '25 02:03 bebilli

Thank you for your guidance. I'll go and give it a try.

Thanks, looking forward to your contribution MR :)

June

juney-nvidia avatar Mar 30 '25 13:03 juney-nvidia

@bebilli any updates:))

anonymousleek avatar Apr 07 '25 09:04 anonymousleek

@anonymousleek take a look at #3247

InCogNiTo124 avatar Apr 07 '25 19:04 InCogNiTo124

@InCogNiTo124 Does it support Gemma-3-27B?

bebilli avatar Apr 09 '25 11:04 bebilli

@bebilli are you able to check if it support Gemma-3-27B?

derektan5 avatar Apr 26 '25 19:04 derektan5

@bebilli are you able to check if it support Gemma-3-27B?

not yet: https://github.com/NVIDIA/TensorRT-LLM/pull/3247#issuecomment-2790519367

InCogNiTo124 avatar Apr 26 '25 20:04 InCogNiTo124

I am also interested in Gemma-3 12B Q4 variety. If I have time, I will give it a shot.

--Chris

davidcforbes avatar May 15 '25 20:05 davidcforbes

I am also interested in faster inference for gemma-3 27b 4 bit quantized model.

Would like to know if gemma 3 is supported by TensorRT LLM Engine in python.

Thank you

Vedapani0402 avatar May 27 '25 03:05 Vedapani0402

Are there any changes on this topic?

andrelohmann avatar Jun 08 '25 15:06 andrelohmann

Also interested in Gemma-3-27b

rahchuenmonroe avatar Jun 11 '25 19:06 rahchuenmonroe

hi any updated on gemma3 support?

geraldstanje1 avatar Aug 22 '25 19:08 geraldstanje1

any update on gemma3 series and gemma3n support ?

StephennFernandes avatar Aug 27 '25 19:08 StephennFernandes

hi @juney-nvidia is gemma3 (google/gemma-3-4b-it) serving using triton inference server with tensorrt llm using python backend possible? im fine with text only for now...

geraldstanje avatar Aug 28 '25 09:08 geraldstanje

Gemma 3 support started in TensorRT-LLM v0.19.0 with text-only models, and has been continuously enhanced with PyTorch workflow support (v0.20.0) and multimodal capabilities (v0.21.0). Also the support can be checked in https://nvidia.github.io/TensorRT-LLM/latest/reference/support-matrix.html

karljang avatar Sep 08 '25 19:09 karljang

hi @karljang what is PyTorch workflow - do you mean llm api? how can i quantize gemma3-4b (google/gemma-3-4b-it) model for text only? sand use PyTorch workflow of inference?

geraldstanje avatar Sep 08 '25 20:09 geraldstanje

Hell @karljang and @juney-nvidia ,

I understand that the Pytorch workflow supports Gemma3 VLM models (Gemma3ForConditionalGeneration). However, like the comment above I am also interested in running these models quantized.

In my case I'm trying to run google/gemma-3-27b-it in 4-bit precision so that it fits on my RTX 5090.

I tried to download the main branch and run the trt-llm quantize script using int4_awq because it is mentioned as supported for gemma3 in the modelopt LLM PTQ docs [1]. When I run it I get an error mentioning "vocab_size" and the Gemma3Config:

`Traceback (most recent call last): File "/my-path/trt/quantize.py", line 160, in quantize_and_export(

File "/my-path/trt/trt/lib/python3.12/site-packages/tensorrt_llm/quantization/quantize_by_modelopt.py", line 817, in quantize_and_export export_tensorrt_llm_checkpoint(

File "/my-path/trt/trt/lib/python3.12/site-packages/modelopt/torch/export/model_config_export.py", line 555, in export_tensorrt_llm_checkpoint raise e

File "/my-path/trt/trt/lib/python3.12/site-packages/modelopt/torch/export/model_config_export.py", line 486, in export_tensorrt_llm_checkpoint for (

File "/my-path/trt/trt/lib/python3.12/site-packages/modelopt/torch/export/model_config_export.py", line 152, in torch_to_tensorrt_llm_checkpoint vocab_size = model.config.vocab_size ^^^^^^^^^^^^^^^^^^^^^^^

File "/my-path/trt/trt/lib/python3.12/site-packages/transformers/configuration_utils.py", line 211, in getattribute return super().getattribute(key) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

AttributeError: 'Gemma3Config' object has no attribute 'vocab_size' `

From looking at configuration_gemma3.py I can see the Gemma3Config class is for the Gemma3ForConditionalGeneration architecture but this config doesn't seem to be implemented in the quantization logic (quantize_by_modelopt.py or model_config_export.py).

I'm assuming this is because the TensorRT Model Optimizer doesn't support Gemma3 VLM models [2]?

My Ask: Is there a supported and/or recommended method to quantize the larger Gemma3 VLM models into 4-bit precision for use with the Pytorch workflow?

[1] https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/examples/llm_ptq/README.md#hugging-face-supported-models [2] https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/vlm_ptq#supported-models

jordan-wriker avatar Sep 13 '25 08:09 jordan-wriker

@brb-nv have the same question as above.

bebilli avatar Sep 13 '25 08:09 bebilli

Issue has not received an update in over 14 days. Adding stale label.

github-actions[bot] avatar Sep 27 '25 09:09 github-actions[bot]

hi @karljang and @juney-nvidia any update?

geraldstanje1 avatar Sep 27 '25 19:09 geraldstanje1

@geraldstanje , I discussed this with the team, and it seems we currently don’t have plans to support Gemma 4-bit quantization, especially on RTX hardware.

karljang avatar Oct 29 '25 23:10 karljang

@karljang so for that case we need to use triton with vllm backend?

geraldstanje1 avatar Oct 30 '25 02:10 geraldstanje1

@geraldstanje , Sorry, I'm not very familiar with vLLM's quantization capabilities. To clarify, my earlier comment wasn’t meant to suggest that Gemma3 4-bit quantization is impossible. I believe it can be achieved using TensorRT Model Optimizer. We're currently prioritizing support for other models and hardware platforms due to limited time and resources. That said, contributions are always welcome!

karljang avatar Oct 30 '25 04:10 karljang

Issue has not received an update in over 14 days. Adding stale label.

github-actions[bot] avatar Nov 14 '25 03:11 github-actions[bot]

This issue was closed because it has been 14 days without activity since it has been marked as stale.

github-actions[bot] avatar Nov 29 '25 03:11 github-actions[bot]