gemma 3 architecture
can you add gemma 3 architecture?
+1
Would be epic, ollama and llama.cpp implemented it already
I found follow attention mask type is not support now: https://github.com/huggingface/transformers/blob/42ebb6c23e61119f769d7c7c067d5b4ae10e4a7f/src/transformers/models/gemma3/modeling_gemma3.py#L1147
# Apply bidirectional mask on images if token type ids are provided
if token_type_ids is not None and sequence_length != 1:
token_type_mask = token_type_ids.unsqueeze(1) == token_type_ids.unsqueeze(2)
token_type_mask[token_type_ids == 0] = False # if text token do not change anything
token_type_mask = token_type_mask.unsqueeze(1).to(causal_mask.device, dtype=torch.bool)
causal_mask = causal_mask.clone()
causal_mask[:, :, :, :sequence_length] = causal_mask[:, :, :, :sequence_length].masked_fill(
token_type_mask, 0.0
)
Could i set the attention_mask of the prefill stage with executor api? Thanks .
I found follow attention mask type is not support now: https://github.com/huggingface/transformers/blob/42ebb6c23e61119f769d7c7c067d5b4ae10e4a7f/src/transformers/models/gemma3/modeling_gemma3.py#L1147
# Apply bidirectional mask on images if token type ids are provided if token_type_ids is not None and sequence_length != 1: token_type_mask = token_type_ids.unsqueeze(1) == token_type_ids.unsqueeze(2) token_type_mask[token_type_ids == 0] = False # if text token do not change anything token_type_mask = token_type_mask.unsqueeze(1).to(causal_mask.device, dtype=torch.bool) causal_mask = causal_mask.clone() causal_mask[:, :, :, :sequence_length] = causal_mask[:, :, :, :sequence_length].masked_fill( token_type_mask, 0.0 )Could i set the attention_mask of the prefill stage with executor api? Thanks .
I'm blocked here. I have supported gemma3 text llm refer to gemma2. And is normal when only input text without image token. But when input with image(by ptuning embedding), output will be wrong. I finally found the attention mask is different for text token and image token in prefill phase. Such as:
Only text token (A causal mask):
| token1 | token2 | token3 | token4 | token5 | |
|---|---|---|---|---|---|
| token1 | 0 | -inf | -inf | -inf | -inf |
| token2 | 0 | 0 | -inf | -inf | -inf |
| token3 | 0 | 0 | 0 | -inf | -inf |
| token4 | 0 | 0 | 0 | 0 | -inf |
| token5 | 0 | 0 | 0 | 0 | 0 |
With image token (Not a pure causal mask):
| txt_token1 | img_token2 | img_token3 | img_token4 | txt_token5 | |
|---|---|---|---|---|---|
| txt_token1 | 0 | -inf | -inf | -inf | -inf |
| img_token2 | 0 | 0 | 0 | 0 | -inf |
| img_token3 | 0 | 0 | 0 | 0 | -inf |
| img_token4 | 0 | 0 | 0 | 0 | -inf |
| txt_token5 | 0 | 0 | 0 | 0 | 0 |
Is there any some good method to support it? Thanks very much!
I also found gemma3 use sliding_window causal attention mask not causal, which result in error output if a long input. Could we support sliding window causal mask? If support, is there description of it? Thanks very much.
Additionally, i found current gpt attention not support AttentionMaskType.sliding_window_causal type:
https://github.com/NVIDIA/TensorRT-LLM/blob/9b931c0f6305aefa3660e6fb84a76a42c0eef167/tensorrt_llm/layers/attention.py#L1001
The sliding_window_causal is a important feature for some new llm.
I had try support gemm3 text llm (https://github.com/NetEase-Media/grps_trtllm/tree/master/tools/gemma3/tensorrt_llm_mod). But there is one issue: not support kv cache reuse, ref https://github.com/NVIDIA/TensorRT-LLM/issues/2912 . Additionally, can not process image token because not support image token attention mask as https://github.com/NVIDIA/TensorRT-LLM/issues/2880#issuecomment-2726181463.
@zhaocc1106 gemma3 is working for text only then you are talking about gemma3-1b or it can also work for gemma3-27B model but for text only inference?
hi @derektan5 is gemma3 (google/gemma-3-4b-it) serving using triton inference server with tensorrt llm using python backend possible? do you have small example?