Describe the bug The sequence length during training is different than specified, in the configs, I've specified seq-len 50016, which is divisible by the tensor-model-parallel-size=4, however, during multinode training I'm seeing 50341 as the dimension.

lm_output = self.language_model(
  File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)

  File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)

  File "/sensei-fs/users/lakshya/video_gen/Megatron-LM/megatron/model/language_model.py", line 470, in forward
    encoder_input = self.embedding(enc_input_ids, enc_position_ids,

  File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)

  File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)

  File "/sensei-fs/users/lakshya/video_gen/Megatron-LM/megatron/model/language_model.py", line 239, in forward
    embeddings = tensor_parallel.scatter_to_sequence_parallel_region(embeddings)

  File "/sensei-fs/users/lakshya/video_gen/Megatron-LM/megatron/core/tensor_parallel/mappings.py", line 342, in scatter_to_sequence_parallel_region
    return _ScatterToSequenceParallelRegion.apply(input_)

  File "/opt/venv/lib/python3.10/site-packages/torch/autograd/function.py", line 539, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]

  File "/sensei-fs/users/lakshya/video_gen/Megatron-LM/megatron/core/tensor_parallel/mappings.py", line 239, in forward
    return _split_along_first_dim(input_)

  File "/sensei-fs/users/lakshya/video_gen/Megatron-LM/megatron/core/tensor_parallel/mappings.py", line 59, in _split_along_first_dim
    dim_size % world_size == 0

  AssertionError: First dimension of the tensor should be divisible by tensor parallel size: 50341 % 4 != 0

To Reproduce run the pretrain_gpt_distributed_with_mp.sh with following args:

`GPUS_PER_NODE=8

Change for multinode config

MASTER_ADDR=10.43.176.218 MASTER_PORT=6000 NNODES=6 NODE_RANK=$1 WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))

CHECKPOINT_PATH=./ckpts/ VOCAB_FILE=datasets/gpt2/gpt2-vocab.json MERGE_FILE=datasets/gpt2/gpt2-merges.txt DATA_PATH=datasets/gpt2/my-gpt2_text_document

DISTRIBUTED_ARGS=" --nproc_per_node $GPUS_PER_NODE
--nnodes $NNODES
--node_rank $NODE_RANK
--master_addr $MASTER_ADDR
--master_port $MASTER_PORT "

GPT_ARGS=" --tensor-model-parallel-size 4
--pipeline-model-parallel-size 2
--sequence-parallel
--num-layers 44
--hidden-size 1344
--num-attention-heads 24
--seq-length 50016
--max-position-embeddings 50016
--micro-batch-size 1
--global-batch-size 12
--lr 0.00015
--train-iters 500000
--lr-decay-iters 320000
--lr-decay-style cosine
--min-lr 1.0e-5
--weight-decay 1e-2
--lr-warmup-fraction .01
--clip-grad 1.0
--fp16
--use-flash-attn "`