Megatron-LM [BUG] Example of pretraining BERT does not work

Describe the bug Runing the Pretraining BERT encountered two issues:

the "TransformerEngine only supports softmax compute in FP32". Need to add --attention-softmax-in-fp32 to the model arguments. This applies to Pretraining GPT pretrain_gpt.sh too.
The attention mask is of dimension [B, 1, max_seqlen, max_seqlen]; however, the function get_cu_seqlens expects its shape to be [B, 1, 1, max_seqlen]. The training crashes. See the log below.

To Reproduce run the example: ./examples/pretrain_bert.sh in the docker image nvcr.io/nvidia/pytorch:24.02-py3 with the main branch of Megatron-LM. The issues was found in the core_r0.6.0 branch too.

Expected behavior expect the example runs out of box.

Stack trace/logs

[after dataloaders are built] datetime: 2024-04-23 00:29:39 
done with setup ...
(min, max) time across ranks (ms):
    model-and-optimizer-setup ......................: (5967.29, 5967.29)
    train/valid/test-data-iterators-setup ..........: (128.70, 128.70)
training ...
[before the start of training step] datetime: 2024-04-23 00:29:39 
torch.Size([4, 1, 512, 512])
Traceback (most recent call last):
  File "/pscratch/sd/x/xju/LLMTracking/Megatron-LM/pretrain_bert.py", line 194, in <module>
    pretrain(train_valid_test_datasets_provider, model_provider,
  File "/pscratch/sd/x/xju/LLMTracking/Megatron-LM/megatron/training/training.py", line 270, in pretrain
    iteration, num_floating_point_operations_so_far = train(
  File "/pscratch/sd/x/xju/LLMTracking/Megatron-LM/megatron/training/training.py", line 990, in train
    train_step(forward_step_func,
  File "/pscratch/sd/x/xju/LLMTracking/Megatron-LM/megatron/training/training.py", line 541, in train_step
    losses_reduced = forward_backward_func(
  File "/pscratch/sd/x/xju/LLMTracking/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 356, in forward_backward_no_pipelining
    output_tensor = forward_step(
  File "/pscratch/sd/x/xju/LLMTracking/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 192, in forward_step
    output_tensor, loss_func = forward_step_func(data_iterator, model)
  File "/pscratch/sd/x/xju/LLMTracking/Megatron-LM/pretrain_bert.py", line 139, in forward_step
    output_tensor = model(tokens, padding_mask,
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/pscratch/sd/x/xju/LLMTracking/Megatron-LM/megatron/core/distributed/distributed_data_parallel.py", line 179, in forward
    return self.module(*inputs, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/pscratch/sd/x/xju/LLMTracking/Megatron-LM/megatron/legacy/model/module.py", line 190, in forward
    outputs = self.module(*inputs, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/pscratch/sd/x/xju/LLMTracking/Megatron-LM/megatron/legacy/model/bert_model.py", line 182, in forward
    lm_output = self.language_model(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/pscratch/sd/x/xju/LLMTracking/Megatron-LM/megatron/legacy/model/language_model.py", line 493, in forward
    encoder_output = self.encoder(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/pscratch/sd/x/xju/LLMTracking/Megatron-LM/megatron/legacy/model/transformer.py", line 1777, in forward
    hidden_states = layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/transformer.py", line 625, in forward
    self_attention_outputs = self.self_attention(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/attention.py", line 3461, in forward
    context_layer = self.core_attention(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/attention.py", line 2724, in forward
    return self.fused_attention(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 417, in _fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/attention.py", line 2055, in forward
    _cu_seqlens_q = get_cu_seqlens(attention_mask)
  File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/attention.py", line 166, in get_cu_seqlens
    cu_seqlens = torch.cat((zero, cu_seqlens))
RuntimeError: Tensors must have same number of dimensions: got 1 and 2

Environment (please complete the following information): Used the docker image: nvcr.io/nvidia/pytorch:24.02-py3.

Megatron-LM commit ID: ccfeda4
PyTorch version: 2.3.0a0+ebedce2
CUDA version: 12.3
NCCL version 2.20.3

Proposed fix N/A

Additional context N/A

Apr 23 '24 00:04 xju2

Facing the same issue, please let me know if you have found a fix! Thanks!

May 28 '24 04:05 abhivijay96

in Megatron-DeepSpeed/megatron/model/bert_model.py，there is a line:

extended_attention_mask = bert_extended_attention_mask(attention_mask)

which bert_extended_attention_mask is define like:

def bert_extended_attention_mask(attention_mask):
    # We create a 3D attention mask from a 2D tensor mask.
    # [b, 1, s]
    attention_mask_b1s = attention_mask.unsqueeze(1)
    # [b, s, 1]
    attention_mask_bs1 = attention_mask.unsqueeze(2)
    # [b, s, s]
    attention_mask_bss = attention_mask_b1s * attention_mask_bs1
    # [b, 1, s, s]
    extended_attention_mask = attention_mask_bss.unsqueeze(1)

    # Convert attention mask to binary:
    extended_attention_mask = (extended_attention_mask < 0.5)

    return extended_attention_mask

the attention_mask is extended from [b,s] to [b,1,s,s]. Is this the cause of the problem? If so, how can I fix it?

Used the docker image: nvcr.io/nvidia/pytorch:23.12-py3 Megatron-LM commit ID: c4d12e2

Jun 05 '24 23:06 HenryLiu0

in Megatron-DeepSpeed/megatron/model/bert_model.py，there is a line:
extended_attention_mask = bert_extended_attention_mask(attention_mask)
which bert_extended_attention_mask is define like:
def bert_extended_attention_mask(attention_mask):
    # We create a 3D attention mask from a 2D tensor mask.
    # [b, 1, s]
    attention_mask_b1s = attention_mask.unsqueeze(1)
    # [b, s, 1]
    attention_mask_bs1 = attention_mask.unsqueeze(2)
    # [b, s, s]
    attention_mask_bss = attention_mask_b1s * attention_mask_bs1
    # [b, 1, s, s]
    extended_attention_mask = attention_mask_bss.unsqueeze(1)

    # Convert attention mask to binary:
    extended_attention_mask = (extended_attention_mask < 0.5)

    return extended_attention_mask
the attention_mask is extended from [b,s] to [b,1,s,s]. Is this the cause of the problem? If so, how can I fix it?

Used the docker image: nvcr.io/nvidia/pytorch:23.12-py3 Megatron-LM commit ID: c4d12e2

use Megatron-LM branch 23.08 with docker image: nvcr.io/nvidia/pytorch:23.08-py3 can avoid this problem.

Jun 06 '24 08:06 HenryLiu0

Facing the same issue, please let me know if you have found a fix! Thanks!

Jul 15 '24 10:07 hnyoumfk

same issue here

Jul 30 '24 13:07 RobinAlgayres

It turns out that transformer_engine assumes the attention mask is of shape [b, 1, 1, s], however, as pointed out by @HenryLiu0, MegatronLM creates an extended attention mask with shape [b, 1, s, s]. To resolve this, create a text file attention.patch with the following content:

--- /usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/attention.py	2024-08-02 22:24:11.000000000 +0000
+++ attention.py	2024-09-05 17:34:58.000000000 +0000
@@ -223,6 +223,8 @@
     tensor of shape [batch_size + 1] containing the cumulative sequence lengths of
     the samples in a batch.
     """
+    if mask.shape[2] != 1:
+        mask = mask[:, :, 0, :]
     mask = mask.squeeze(1).squeeze(1)
     reduced_mask = mask.logical_not().sum(dim=1)
     cu_seqlens = reduced_mask.cumsum(dim=0).to(torch.int32)

And run patch /usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/attention.py < attention.patch.

Sep 05 '24 17:09 xju2