Megatron-LM [Bug]: Attention Mask Ignored in `transformer_engine` Backend with Packed Sequences (Attention Leakage)

Summary

When training pretrain_gpt.py with sequence packing enabled (--reset-position-ids and --reset-attention-mask) and using the --transformer-impl transformer_engine backend, the custom block-diagonal attention mask generated by GPTDataset is effectively ignored.

The Transformer Engine (TE) layer defaults to attn_mask_type='causal', which causes it to disregard the attention_mask tensor passed during the forward pass. This results in silent attention leakage between unrelated documents within a packed sequence.

Reproduction Steps

Run pretrain_gpt.py with the following combination of flags:

python pretrain_gpt.py \
    --transformer-impl transformer_engine \
    --reset-position-ids \
    --reset-attention-mask \

Root Cause Analysis

1. Dataset Does Not Provide `cu_seqlens`

GPTDataset generates a dense boolean (or FP8) attention_mask tensor to handle document boundaries. It does not calculate or return cu_seqlens (cumulative sequence lengths) or PackedSeqParams.

Reference: [megatron/core/datasets/gpt_dataset.py]

2. TE Defaults to Causal Masking

The Transformer Layer is initialized with an attention mask type of causal.

megatron/core/models/gpt/gpt_layer_specs.py

3. API Contract Violation

According to the Transformer Engine documentation, the attention_mask argument in the forward pass is conditional:

Argument attention_mask in the forward call is only used when attn_mask_type includes ‘“padding”’ or “arbitrary”.

Because the configuration remains 'causal', TE invokes the underlying kernel (FlashAttention) with is_causal=True and no custom mask. This applies a standard lower-triangular mask over the entire packed sequence buffer (0..args.seq_length), allowing tokens in Document B to attend to tokens in Document A.

Moreover, if --reset-position-ids was used, the documents will have overlapping positions, making the attention confuse them.

Impact

Correctness: The autoregressive independence assumption is violated for packed sequences.
Silent Failure: The model trains without error, but gradients are computed based on invalid context.

Proposed Solution

The model initialization logic needs to detect if the user has requested a custom mask (via --reset-attention-mask) and configure the TE layer accordingly.

Suggested Logic: If args.reset_attention_mask is True, the attn_mask_type passed to te.pytorch.TransformerLayer must be forced to 'arbitrary'. This forces TE to utilize the attention_mask tensor provided by the dataset. Or just rewrite the entire pretrain_gpt.py up to date with PackedSeqParams IDK. Maybe add more asserts on transformer engine arguments.

Nov 21 '25 23:11 BlackSamorez

cc: @erhoo82 for visibility

Nov 26 '25 17:11 gautham-kollu

other issue #1878 is probably a duplicate (will be solved by this here)

Nov 27 '25 15:11 martinjaggi

A bit more info: the position_ids resetting doesn't matter unless absolute learned positional embeddings are learned, which is not the case for modern models.

Nov 27 '25 15:11 BlackSamorez

[Bug]: Attention Mask Ignored in `transformer_engine` Backend with Packed Sequences (Attention Leakage)

Summary

Reproduction Steps

Root Cause Analysis

1. Dataset Does Not Provide cu_seqlens

2. TE Defaults to Causal Masking

3. API Contract Violation

Impact

Proposed Solution

1. Dataset Does Not Provide `cu_seqlens`