[BUG] vp>1 and TORCH_NCCL_AVOID_RECORD_STREAMS=1 leads to incorrect loss behavior

Open Ktakuya332C opened this issue 7 months ago • 0 comments

Describe the bug The loss curve of a training run with the following configuration

virtual-pipeline-parallel-size > 1
TORCH_NCCL_AVOID_RECORD_STREAMS=1

does not match with the curves of training runs with other configurations, such as

virtual-pipeline-parallel-size = 1
TORCH_NCCL_AVOID_RECORD_STREAMS=0

TORCH_NCCL_AVOID_RECORD_STREAMS=1 : the red curve
TORCH_NCCL_AVOID_RECORD_STREAMS=0 : the yellow curve ( <- this curve is the correct one )

To Reproduce Run training with a script similar to the following submit.slurm script on a slurm cluster with or without TORCH_NCCL_AVOID_RECORD_STREAMS=1

skeleton script

$ cat submit.slurm
#!/bin/bash
#SBATCH --job-name=mcore-0.12.0-vp
#SBATCH --nodes=<redacted>
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node=8
#SBATCH --exclusive
#SBATCH --open-mode=append
set -eux -o pipefail

BASE_DIR="<redacted>"
MODEL_NAME="<redacted>"
CHECKPOINT_DIR="${BASE_DIR}/model/${MODEL_NAME}"
LOG_DIR="${BASE_DIR}/log/${MODEL_NAME}"
TENSORBOARD_DIR="${BASE_DIR}/tensorboard/${MODEL_NAME}"
DATA_PATH="<redacted>"
VALID_DATA_PATH="<redacted>"
TEST_DATA_PATH="<redacted>"
DATA_CACHE_PATH="${BASE_DIR}/data_cache"
TOKENIZER="<redacted>"
CONTAINER_TAG="<redacted>"
MAX_NUM_RETRIES=5

LOG_FILE=${LOG_DIR}/log_masternode_$(hostname)_jobid_${SLURM_JOB_ID}.txt

SRUN_SCRIPT=$(realpath "runner.slurm")
SRUN_ARGS=(
    --container-image "<path/to/container/based/on/nvidia/pytorch:25.04-py3>"
    --container-mounts "<redacted>"
    --no-container-remap-root
    --label
    -K1 # -K, --kill-on-bad-exit does not tolerate spaces between name and value
)

GPT_MODEL_ARGS=(
    --num-layers 16
    --hidden-size 2048
    --num-attention-heads 16
    --seq-length 4096
    --max-position-embeddings 4096
    --position-embedding-type rope
    --rotary-percent 1.0
    --rotary-base 500000
    --normalization RMSNorm
    --norm-epsilon 1e-5
    --disable-bias-linear
    --swiglu
    --group-query-attention
    --num-query-groups 8
    # --untie-embeddings-and-output-weights
)
TRAINING_ARGS=(
    --train-iters 25000
    --global-batch-size 1024
    --micro-batch-size 4
    --lr 2e-4
    --min-lr 2e-5
    --lr-decay-style cosine
    --lr-warmup-fraction 0.01
    --adam-beta1 0.9
    --adam-beta2 0.95
    --weight-decay 0.1
    --clip-grad 1.0
    --init-method-std 0.01
    --attention-dropout 0.0
    --hidden-dropout 0.1
    --cross-entropy-loss-fusion
    --bf16
    --transformer-impl transformer_engine
    --attention-backend flash
)
MODEL_PARALLEL_ARGS=(
    --tensor-model-parallel-size 1
    --pipeline-model-parallel-size 2
    --context-parallel-size 1
    --sequence-parallel
    --num-layers-per-virtual-pipeline-stage 4
    # --decoder-first-pipeline-num-layers 1
    # --decoder-last-pipeline-num-layers 1
    --overlap-grad-reduce
    --overlap-param-gather
    --use-distributed-optimizer
)
DATA_ARGS=(
    --train-data-path ${DATA_PATH}
    --valid-data-path ${VALID_DATA_PATH}
    --test-data-path ${TEST_DATA_PATH}
    --data-cache-path ${DATA_CACHE_PATH}
    --vocab-file ${TOKENIZER}
    --tokenizer-type <redacted>
    --dataloader-type single
    --no-create-attention-mask-in-dataloader
)
EVAL_AND_LOGGING_ARGS=(
    --log-interval 10
    --log-params-norm
    --log-throughput
    --save ${CHECKPOINT_DIR}
    --load ${CHECKPOINT_DIR}
    --ckpt-format torch
    --eval-iters 10
    --tensorboard-dir ${TENSORBOARD_DIR}
    --log-timers-to-tensorboard
    --log-validation-ppl-to-tensorboard
)
OPTIONS=(
    "${GPT_MODEL_ARGS[@]}"
    "${TRAINING_ARGS[@]}"
    "${MODEL_PARALLEL_ARGS[@]}"
    "${DATA_ARGS[@]}"
    "${EVAL_AND_LOGGING_ARGS[@]}"
)

# For Megatron-LM
export CUDA_DEVICE_MAX_CONNECTIONS=1
# For NCCL
export NCCL_DEBUG=WARN
export NCCL_IB_TIMEOUT=22
# For OpenMP
export OMP_NUM_THREADS=14
# For pytorch.distributed
export TORCH_MASTER_ADDR=$(hostname)
export TORCH_MASTER_PORT=25091
export TORCH_NCCL_AVOID_RECORD_STREAMS=1

mkdir -p ${LOG_DIR}

SLURM_RESTART_COUNT=${SLURM_RESTART_COUNT:-0}
EPILOGUE="exit 1"
if [ ${SLURM_RESTART_COUNT} -lt ${MAX_NUM_RETRIES} ]; then
  EPILOGUE="scontrol requeue ${SLURM_JOB_ID}"
fi

(srun "${SRUN_ARGS[@]}" sh -c "bash ${SRUN_SCRIPT} ${OPTIONS[*]}" 2>&1 | tee ${LOG_FILE}) || ${EPILOGUE}

$ cat runner.slurm
#!/bin/bash
set -eux

export MASTER_ADDR=${TORCH_MASTER_ADDR}
export MASTER_PORT=${TORCH_MASTER_PORT}

echo "SLURMD_NODENAME=${SLURMD_NODENAME}"
torchrun \
    --nproc-per-node=${SLURM_GPUS_ON_NODE} \
    --nnodes=${SLURM_NNODES} \
    --node-rank=${SLURM_NODEID} \
    --master-addr=${MASTER_ADDR} \
    --master-port=${MASTER_PORT} \
    pretrain_gpt.py "$@"

Expected behavior The loss curve of the training runs with or without TORCH_NCCL_AVOID_RECORD_STREAMS should be same.

Stack trace/logs Following warning messages appear when TORCH_NCCL_AVOID_RECORD_STREAMS=1

3: [rank25]:[W519 13:47:19.144266543 ProcessGroupNCCL.cpp:3648] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS=1 is experimental for point-to-point collectives. To ensure safety, .wait() must be called on all returned handles before they fall out of scope, including for isend() calls. (function operator())

I could not find the corresponding lines in any commits of the original pytorch, so I assume this warning is original to the nvidia-pytorch.

Environment (please complete the following information):

Megatron-LM commit ID : d580efc (= core_v0.12.0 branch)
PyTorch version : 2.7.0a0+79aa17489c.nv25.4
CUDA version : 12.9
NCCL version : 2.26.3+cuda12.9
Container : nvidia/pytorch:25.04-py3

Other contexts

From the warning messages, I speculate that somewhere in code that runs only when using virtual pipeline parallels mishandles some point-to-point collectives, but I could not pinpoint where that happens.
The original PyTorch already seems to drop TORCH_NCCL_AVOID_RECORD_STREAMS entirely ( https://github.com/pytorch/pytorch/pull/150398 ). This issue might eventually be resolved automatically when nvidia-pytorch merges these upstream changes.

May 19 '25 06:05 Ktakuya332C