[BUG] vp>1 and TORCH_NCCL_AVOID_RECORD_STREAMS=1 leads to incorrect loss behavior
Describe the bug The loss curve of a training run with the following configuration
- virtual-pipeline-parallel-size > 1
- TORCH_NCCL_AVOID_RECORD_STREAMS=1
does not match with the curves of training runs with other configurations, such as
- virtual-pipeline-parallel-size = 1
- TORCH_NCCL_AVOID_RECORD_STREAMS=0
- TORCH_NCCL_AVOID_RECORD_STREAMS=1 : the red curve
- TORCH_NCCL_AVOID_RECORD_STREAMS=0 : the yellow curve ( <- this curve is the correct one )
To Reproduce Run training with a script similar to the following submit.slurm script on a slurm cluster with or without TORCH_NCCL_AVOID_RECORD_STREAMS=1
skeleton script
$ cat submit.slurm
#!/bin/bash
#SBATCH --job-name=mcore-0.12.0-vp
#SBATCH --nodes=<redacted>
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node=8
#SBATCH --exclusive
#SBATCH --open-mode=append
set -eux -o pipefail
BASE_DIR="<redacted>"
MODEL_NAME="<redacted>"
CHECKPOINT_DIR="${BASE_DIR}/model/${MODEL_NAME}"
LOG_DIR="${BASE_DIR}/log/${MODEL_NAME}"
TENSORBOARD_DIR="${BASE_DIR}/tensorboard/${MODEL_NAME}"
DATA_PATH="<redacted>"
VALID_DATA_PATH="<redacted>"
TEST_DATA_PATH="<redacted>"
DATA_CACHE_PATH="${BASE_DIR}/data_cache"
TOKENIZER="<redacted>"
CONTAINER_TAG="<redacted>"
MAX_NUM_RETRIES=5
LOG_FILE=${LOG_DIR}/log_masternode_$(hostname)_jobid_${SLURM_JOB_ID}.txt
SRUN_SCRIPT=$(realpath "runner.slurm")
SRUN_ARGS=(
--container-image "<path/to/container/based/on/nvidia/pytorch:25.04-py3>"
--container-mounts "<redacted>"
--no-container-remap-root
--label
-K1 # -K, --kill-on-bad-exit does not tolerate spaces between name and value
)
GPT_MODEL_ARGS=(
--num-layers 16
--hidden-size 2048
--num-attention-heads 16
--seq-length 4096
--max-position-embeddings 4096
--position-embedding-type rope
--rotary-percent 1.0
--rotary-base 500000
--normalization RMSNorm
--norm-epsilon 1e-5
--disable-bias-linear
--swiglu
--group-query-attention
--num-query-groups 8
# --untie-embeddings-and-output-weights
)
TRAINING_ARGS=(
--train-iters 25000
--global-batch-size 1024
--micro-batch-size 4
--lr 2e-4
--min-lr 2e-5
--lr-decay-style cosine
--lr-warmup-fraction 0.01
--adam-beta1 0.9
--adam-beta2 0.95
--weight-decay 0.1
--clip-grad 1.0
--init-method-std 0.01
--attention-dropout 0.0
--hidden-dropout 0.1
--cross-entropy-loss-fusion
--bf16
--transformer-impl transformer_engine
--attention-backend flash
)
MODEL_PARALLEL_ARGS=(
--tensor-model-parallel-size 1
--pipeline-model-parallel-size 2
--context-parallel-size 1
--sequence-parallel
--num-layers-per-virtual-pipeline-stage 4
# --decoder-first-pipeline-num-layers 1
# --decoder-last-pipeline-num-layers 1
--overlap-grad-reduce
--overlap-param-gather
--use-distributed-optimizer
)
DATA_ARGS=(
--train-data-path ${DATA_PATH}
--valid-data-path ${VALID_DATA_PATH}
--test-data-path ${TEST_DATA_PATH}
--data-cache-path ${DATA_CACHE_PATH}
--vocab-file ${TOKENIZER}
--tokenizer-type <redacted>
--dataloader-type single
--no-create-attention-mask-in-dataloader
)
EVAL_AND_LOGGING_ARGS=(
--log-interval 10
--log-params-norm
--log-throughput
--save ${CHECKPOINT_DIR}
--load ${CHECKPOINT_DIR}
--ckpt-format torch
--eval-iters 10
--tensorboard-dir ${TENSORBOARD_DIR}
--log-timers-to-tensorboard
--log-validation-ppl-to-tensorboard
)
OPTIONS=(
"${GPT_MODEL_ARGS[@]}"
"${TRAINING_ARGS[@]}"
"${MODEL_PARALLEL_ARGS[@]}"
"${DATA_ARGS[@]}"
"${EVAL_AND_LOGGING_ARGS[@]}"
)
# For Megatron-LM
export CUDA_DEVICE_MAX_CONNECTIONS=1
# For NCCL
export NCCL_DEBUG=WARN
export NCCL_IB_TIMEOUT=22
# For OpenMP
export OMP_NUM_THREADS=14
# For pytorch.distributed
export TORCH_MASTER_ADDR=$(hostname)
export TORCH_MASTER_PORT=25091
export TORCH_NCCL_AVOID_RECORD_STREAMS=1
mkdir -p ${LOG_DIR}
SLURM_RESTART_COUNT=${SLURM_RESTART_COUNT:-0}
EPILOGUE="exit 1"
if [ ${SLURM_RESTART_COUNT} -lt ${MAX_NUM_RETRIES} ]; then
EPILOGUE="scontrol requeue ${SLURM_JOB_ID}"
fi
(srun "${SRUN_ARGS[@]}" sh -c "bash ${SRUN_SCRIPT} ${OPTIONS[*]}" 2>&1 | tee ${LOG_FILE}) || ${EPILOGUE}
$ cat runner.slurm
#!/bin/bash
set -eux
export MASTER_ADDR=${TORCH_MASTER_ADDR}
export MASTER_PORT=${TORCH_MASTER_PORT}
echo "SLURMD_NODENAME=${SLURMD_NODENAME}"
torchrun \
--nproc-per-node=${SLURM_GPUS_ON_NODE} \
--nnodes=${SLURM_NNODES} \
--node-rank=${SLURM_NODEID} \
--master-addr=${MASTER_ADDR} \
--master-port=${MASTER_PORT} \
pretrain_gpt.py "$@"
Expected behavior The loss curve of the training runs with or without TORCH_NCCL_AVOID_RECORD_STREAMS should be same.
Stack trace/logs
Following warning messages appear when TORCH_NCCL_AVOID_RECORD_STREAMS=1
3: [rank25]:[W519 13:47:19.144266543 ProcessGroupNCCL.cpp:3648] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS=1 is experimental for point-to-point collectives. To ensure safety, .wait() must be called on all returned handles before they fall out of scope, including for isend() calls. (function operator())
I could not find the corresponding lines in any commits of the original pytorch, so I assume this warning is original to the nvidia-pytorch.
Environment (please complete the following information):
- Megatron-LM commit ID : d580efc (= core_v0.12.0 branch)
- PyTorch version : 2.7.0a0+79aa17489c.nv25.4
- CUDA version : 12.9
- NCCL version : 2.26.3+cuda12.9
- Container : nvidia/pytorch:25.04-py3
Other contexts
- From the warning messages, I speculate that somewhere in code that runs only when using virtual pipeline parallels mishandles some point-to-point collectives, but I could not pinpoint where that happens.
- The original PyTorch already seems to drop TORCH_NCCL_AVOID_RECORD_STREAMS entirely ( https://github.com/pytorch/pytorch/pull/150398 ). This issue might eventually be resolved automatically when nvidia-pytorch merges these upstream changes.