Megatron-LM icon indicating copy to clipboard operation
Megatron-LM copied to clipboard

[BUG]

Open lakshya-4gp opened this issue 2 years ago • 5 comments

Describe the bug The sequence length during training is different than specified, in the configs, I've specified seq-len 50016, which is divisible by the tensor-model-parallel-size=4, however, during multinode training I'm seeing 50341 as the dimension.

lm_output = self.language_model(
  File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)

  File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)

  File "/sensei-fs/users/lakshya/video_gen/Megatron-LM/megatron/model/language_model.py", line 470, in forward
    encoder_input = self.embedding(enc_input_ids, enc_position_ids,

  File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)

  File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)

  File "/sensei-fs/users/lakshya/video_gen/Megatron-LM/megatron/model/language_model.py", line 239, in forward
    embeddings = tensor_parallel.scatter_to_sequence_parallel_region(embeddings)

  File "/sensei-fs/users/lakshya/video_gen/Megatron-LM/megatron/core/tensor_parallel/mappings.py", line 342, in scatter_to_sequence_parallel_region
    return _ScatterToSequenceParallelRegion.apply(input_)

  File "/opt/venv/lib/python3.10/site-packages/torch/autograd/function.py", line 539, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]

  File "/sensei-fs/users/lakshya/video_gen/Megatron-LM/megatron/core/tensor_parallel/mappings.py", line 239, in forward
    return _split_along_first_dim(input_)

  File "/sensei-fs/users/lakshya/video_gen/Megatron-LM/megatron/core/tensor_parallel/mappings.py", line 59, in _split_along_first_dim
    dim_size % world_size == 0

  AssertionError: First dimension of the tensor should be divisible by tensor parallel size: 50341 % 4 != 0

To Reproduce run the pretrain_gpt_distributed_with_mp.sh with following args:

`GPUS_PER_NODE=8

Change for multinode config

MASTER_ADDR=10.43.176.218 MASTER_PORT=6000 NNODES=6 NODE_RANK=$1 WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))

CHECKPOINT_PATH=./ckpts/ VOCAB_FILE=datasets/gpt2/gpt2-vocab.json MERGE_FILE=datasets/gpt2/gpt2-merges.txt DATA_PATH=datasets/gpt2/my-gpt2_text_document

DISTRIBUTED_ARGS=" --nproc_per_node $GPUS_PER_NODE
--nnodes $NNODES
--node_rank $NODE_RANK
--master_addr $MASTER_ADDR
--master_port $MASTER_PORT "

GPT_ARGS=" --tensor-model-parallel-size 4
--pipeline-model-parallel-size 2
--sequence-parallel
--num-layers 44
--hidden-size 1344
--num-attention-heads 24
--seq-length 50016
--max-position-embeddings 50016
--micro-batch-size 1
--global-batch-size 12
--lr 0.00015
--train-iters 500000
--lr-decay-iters 320000
--lr-decay-style cosine
--min-lr 1.0e-5
--weight-decay 1e-2
--lr-warmup-fraction .01
--clip-grad 1.0
--fp16
--use-flash-attn "`

Expected behavior During training the input sequence length should be 50016

Stack trace/logs If applicable, add the stack trace or logs from the time of the error.

Environment (please complete the following information):

  • Megatron-LM commit ID: using the latest main branch
  • PyTorch version: 2.1.0
  • CUDA version: 12.2
  • NCCL version 2.17.1

Proposed fix If you have a proposal for how to fix the issue state it here or link to a PR.

Additional context Add any other context about the problem here.

lakshya-4gp avatar Mar 20 '24 23:03 lakshya-4gp

Marking as stale. No activity in 60 days.

github-actions[bot] avatar May 20 '24 18:05 github-actions[bot]

Same error here.

seanliu96 avatar May 30 '24 19:05 seanliu96

I have got the same error, with specified dataset, global_batch_size, and sequence_parllel on.

LiuLinyun avatar Jun 04 '24 15:06 LiuLinyun

Marking as stale. No activity in 60 days.

github-actions[bot] avatar Aug 03 '24 18:08 github-actions[bot]

Same error here.

ChenQiaoling00 avatar Aug 21 '24 11:08 ChenQiaoling00

Marking as stale. No activity in 60 days.

github-actions[bot] avatar Oct 20 '24 18:10 github-actions[bot]