TensorRT-LLM gemma 3 architecture

can you add gemma 3 architecture?

Mar 12 '25 10:03 Alireza3242

+1

Mar 13 '25 03:03 zhaocc1106

Would be epic, ollama and llama.cpp implemented it already

Mar 14 '25 11:03 artur-pf

I found follow attention mask type is not support now: https://github.com/huggingface/transformers/blob/42ebb6c23e61119f769d7c7c067d5b4ae10e4a7f/src/transformers/models/gemma3/modeling_gemma3.py#L1147

# Apply bidirectional mask on images if token type ids are provided
        if token_type_ids is not None and sequence_length != 1:
            token_type_mask = token_type_ids.unsqueeze(1) == token_type_ids.unsqueeze(2)
            token_type_mask[token_type_ids == 0] = False  # if text token do not change anything
            token_type_mask = token_type_mask.unsqueeze(1).to(causal_mask.device, dtype=torch.bool)
            causal_mask = causal_mask.clone()
            causal_mask[:, :, :, :sequence_length] = causal_mask[:, :, :, :sequence_length].masked_fill(
                token_type_mask, 0.0
            )

Could i set the attention_mask of the prefill stage with executor api? Thanks .

Mar 14 '25 16:03 zhaocc1106

I found follow attention mask type is not support now: https://github.com/huggingface/transformers/blob/42ebb6c23e61119f769d7c7c067d5b4ae10e4a7f/src/transformers/models/gemma3/modeling_gemma3.py#L1147
# Apply bidirectional mask on images if token type ids are provided
        if token_type_ids is not None and sequence_length != 1:
            token_type_mask = token_type_ids.unsqueeze(1) == token_type_ids.unsqueeze(2)
            token_type_mask[token_type_ids == 0] = False  # if text token do not change anything
            token_type_mask = token_type_mask.unsqueeze(1).to(causal_mask.device, dtype=torch.bool)
            causal_mask = causal_mask.clone()
            causal_mask[:, :, :, :sequence_length] = causal_mask[:, :, :, :sequence_length].masked_fill(
                token_type_mask, 0.0
            )
Could i set the attention_mask of the prefill stage with executor api? Thanks .

I'm blocked here. I have supported gemma3 text llm refer to gemma2. And is normal when only input text without image token. But when input with image(by ptuning embedding), output will be wrong. I finally found the attention mask is different for text token and image token in prefill phase. Such as:

Only text token (A causal mask):

	token2	token3	token4	token5
token1	-inf	-inf	-inf	-inf
token2	0	-inf	-inf	-inf
token3	0	0	-inf	-inf
token4	0	0	0	-inf
token5	0	0	0	0

With image token (Not a pure causal mask):

	img_token2	img_token3	img_token4	txt_token5
txt_token1	-inf	-inf	-inf	-inf
img_token2	0	0	0	-inf
img_token3	0	0	0	-inf
img_token4	0	0	0	-inf
txt_token5	0	0	0	0

Is there any some good method to support it? Thanks very much!

Mar 15 '25 03:03 zhaocc1106

I also found gemma3 use sliding_window causal attention mask not causal, which result in error output if a long input. Could we support sliding window causal mask? If support, is there description of it? Thanks very much.

Additionally, i found current gpt attention not support AttentionMaskType.sliding_window_causal type: https://github.com/NVIDIA/TensorRT-LLM/blob/9b931c0f6305aefa3660e6fb84a76a42c0eef167/tensorrt_llm/layers/attention.py#L1001

The sliding_window_causal is a important feature for some new llm.

Mar 17 '25 10:03 zhaocc1106

I had try support gemm3 text llm (https://github.com/NetEase-Media/grps_trtllm/tree/master/tools/gemma3/tensorrt_llm_mod). But there is one issue: not support kv cache reuse, ref https://github.com/NVIDIA/TensorRT-LLM/issues/2912 . Additionally, can not process image token because not support image token attention mask as https://github.com/NVIDIA/TensorRT-LLM/issues/2880#issuecomment-2726181463.

Mar 20 '25 06:03 zhaocc1106

@zhaocc1106 gemma3 is working for text only then you are talking about gemma3-1b or it can also work for gemma3-27B model but for text only inference?

May 03 '25 17:05 derektan5

hi @derektan5 is gemma3 (google/gemma-3-4b-it) serving using triton inference server with tensorrt llm using python backend possible? do you have small example?

Aug 28 '25 09:08 geraldstanje