[BUG]
Describe the bug The sequence length during training is different than specified, in the configs, I've specified seq-len 50016, which is divisible by the tensor-model-parallel-size=4, however, during multinode training I'm seeing 50341 as the dimension.
lm_output = self.language_model(
File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/sensei-fs/users/lakshya/video_gen/Megatron-LM/megatron/model/language_model.py", line 470, in forward
encoder_input = self.embedding(enc_input_ids, enc_position_ids,
File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/sensei-fs/users/lakshya/video_gen/Megatron-LM/megatron/model/language_model.py", line 239, in forward
embeddings = tensor_parallel.scatter_to_sequence_parallel_region(embeddings)
File "/sensei-fs/users/lakshya/video_gen/Megatron-LM/megatron/core/tensor_parallel/mappings.py", line 342, in scatter_to_sequence_parallel_region
return _ScatterToSequenceParallelRegion.apply(input_)
File "/opt/venv/lib/python3.10/site-packages/torch/autograd/function.py", line 539, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/sensei-fs/users/lakshya/video_gen/Megatron-LM/megatron/core/tensor_parallel/mappings.py", line 239, in forward
return _split_along_first_dim(input_)
File "/sensei-fs/users/lakshya/video_gen/Megatron-LM/megatron/core/tensor_parallel/mappings.py", line 59, in _split_along_first_dim
dim_size % world_size == 0
AssertionError: First dimension of the tensor should be divisible by tensor parallel size: 50341 % 4 != 0
To Reproduce run the pretrain_gpt_distributed_with_mp.sh with following args:
`GPUS_PER_NODE=8
Change for multinode config
MASTER_ADDR=10.43.176.218 MASTER_PORT=6000 NNODES=6 NODE_RANK=$1 WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
CHECKPOINT_PATH=./ckpts/ VOCAB_FILE=datasets/gpt2/gpt2-vocab.json MERGE_FILE=datasets/gpt2/gpt2-merges.txt DATA_PATH=datasets/gpt2/my-gpt2_text_document
DISTRIBUTED_ARGS="
--nproc_per_node $GPUS_PER_NODE
--nnodes $NNODES
--node_rank $NODE_RANK
--master_addr $MASTER_ADDR
--master_port $MASTER_PORT
"
GPT_ARGS="
--tensor-model-parallel-size 4
--pipeline-model-parallel-size 2
--sequence-parallel
--num-layers 44
--hidden-size 1344
--num-attention-heads 24
--seq-length 50016
--max-position-embeddings 50016
--micro-batch-size 1
--global-batch-size 12
--lr 0.00015
--train-iters 500000
--lr-decay-iters 320000
--lr-decay-style cosine
--min-lr 1.0e-5
--weight-decay 1e-2
--lr-warmup-fraction .01
--clip-grad 1.0
--fp16
--use-flash-attn
"`
Expected behavior During training the input sequence length should be 50016
Stack trace/logs If applicable, add the stack trace or logs from the time of the error.
Environment (please complete the following information):
- Megatron-LM commit ID: using the latest main branch
- PyTorch version: 2.1.0
- CUDA version: 12.2
- NCCL version 2.17.1
Proposed fix If you have a proposal for how to fix the issue state it here or link to a PR.
Additional context Add any other context about the problem here.
Marking as stale. No activity in 60 days.
Same error here.
I have got the same error, with specified dataset, global_batch_size, and sequence_parllel on.
Marking as stale. No activity in 60 days.
Same error here.
Marking as stale. No activity in 60 days.