TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

gemma 3 architecture

Open Alireza3242 opened this issue 10 months ago • 6 comments

can you add gemma 3 architecture?

Alireza3242 avatar Mar 12 '25 10:03 Alireza3242

+1

zhaocc1106 avatar Mar 13 '25 03:03 zhaocc1106

Would be epic, ollama and llama.cpp implemented it already

artur-pf avatar Mar 14 '25 11:03 artur-pf

I found follow attention mask type is not support now: https://github.com/huggingface/transformers/blob/42ebb6c23e61119f769d7c7c067d5b4ae10e4a7f/src/transformers/models/gemma3/modeling_gemma3.py#L1147

# Apply bidirectional mask on images if token type ids are provided
        if token_type_ids is not None and sequence_length != 1:
            token_type_mask = token_type_ids.unsqueeze(1) == token_type_ids.unsqueeze(2)
            token_type_mask[token_type_ids == 0] = False  # if text token do not change anything
            token_type_mask = token_type_mask.unsqueeze(1).to(causal_mask.device, dtype=torch.bool)
            causal_mask = causal_mask.clone()
            causal_mask[:, :, :, :sequence_length] = causal_mask[:, :, :, :sequence_length].masked_fill(
                token_type_mask, 0.0
            )

Could i set the attention_mask of the prefill stage with executor api? Thanks .

zhaocc1106 avatar Mar 14 '25 16:03 zhaocc1106

I found follow attention mask type is not support now: https://github.com/huggingface/transformers/blob/42ebb6c23e61119f769d7c7c067d5b4ae10e4a7f/src/transformers/models/gemma3/modeling_gemma3.py#L1147

# Apply bidirectional mask on images if token type ids are provided
        if token_type_ids is not None and sequence_length != 1:
            token_type_mask = token_type_ids.unsqueeze(1) == token_type_ids.unsqueeze(2)
            token_type_mask[token_type_ids == 0] = False  # if text token do not change anything
            token_type_mask = token_type_mask.unsqueeze(1).to(causal_mask.device, dtype=torch.bool)
            causal_mask = causal_mask.clone()
            causal_mask[:, :, :, :sequence_length] = causal_mask[:, :, :, :sequence_length].masked_fill(
                token_type_mask, 0.0
            )

Could i set the attention_mask of the prefill stage with executor api? Thanks .

I'm blocked here. I have supported gemma3 text llm refer to gemma2. And is normal when only input text without image token. But when input with image(by ptuning embedding), output will be wrong. I finally found the attention mask is different for text token and image token in prefill phase. Such as:

Only text token (A causal mask):

token1 token2 token3 token4 token5
token1 0 -inf -inf -inf -inf
token2 0 0 -inf -inf -inf
token3 0 0 0 -inf -inf
token4 0 0 0 0 -inf
token5 0 0 0 0 0

With image token (Not a pure causal mask):

txt_token1 img_token2 img_token3 img_token4 txt_token5
txt_token1 0 -inf -inf -inf -inf
img_token2 0 0 0 0 -inf
img_token3 0 0 0 0 -inf
img_token4 0 0 0 0 -inf
txt_token5 0 0 0 0 0

Is there any some good method to support it? Thanks very much!

zhaocc1106 avatar Mar 15 '25 03:03 zhaocc1106

I also found gemma3 use sliding_window causal attention mask not causal, which result in error output if a long input. Could we support sliding window causal mask? If support, is there description of it? Thanks very much.

Additionally, i found current gpt attention not support AttentionMaskType.sliding_window_causal type: https://github.com/NVIDIA/TensorRT-LLM/blob/9b931c0f6305aefa3660e6fb84a76a42c0eef167/tensorrt_llm/layers/attention.py#L1001

The sliding_window_causal is a important feature for some new llm.

zhaocc1106 avatar Mar 17 '25 10:03 zhaocc1106

I had try support gemm3 text llm (https://github.com/NetEase-Media/grps_trtllm/tree/master/tools/gemma3/tensorrt_llm_mod). But there is one issue: not support kv cache reuse, ref https://github.com/NVIDIA/TensorRT-LLM/issues/2912 . Additionally, can not process image token because not support image token attention mask as https://github.com/NVIDIA/TensorRT-LLM/issues/2880#issuecomment-2726181463.

zhaocc1106 avatar Mar 20 '25 06:03 zhaocc1106

@zhaocc1106 gemma3 is working for text only then you are talking about gemma3-1b or it can also work for gemma3-27B model but for text only inference?

derektan5 avatar May 03 '25 17:05 derektan5

hi @derektan5 is gemma3 (google/gemma-3-4b-it) serving using triton inference server with tensorrt llm using python backend possible? do you have small example?

geraldstanje avatar Aug 28 '25 09:08 geraldstanje