torchft
torchft copied to clipboard
DDP models are different when training is interrupted
Hi folks, not sure if I'm doing anything wrong. I saw a problem where the final models across ranks are different when training is interrupted.
To reproduce:
Use the following script to launch train_ddp.py across 3 different nodes, 1 GPU per node.
pip install torchft_nightly-2025.7.27-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
export TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC=180
if [ -z "${RANK}" ] || [ "${RANK}" == "0" ]; then
RUST_BACKTRACE=1 torchft_lighthouse --min_replicas 2 --quorum_tick_ms 100 --join_timeout_ms 10000 --bind 0.0.0.0:${PORT} &
export TORCHFT_LIGHTHOUSE="http://localhost:${PORT}"
else
export TORCHFT_LIGHTHOUSE="http://${MASTER_ADDR}:${PORT}"
fi
script_file="train_ddp.py"
cmd=(torchrun
--nproc_per_node="$SLURM_GPUS_PER_NODE"
--rdzv_backend c10d
--rdzv_endpoint="localhost:0"
"$script_file"
--
"$@")
# Print the command
echo "Executing: ${cmd[@]}"
# Execute the command
for ((i=1; i<=3; i++))
do
"${cmd[@]}"
if [ $? -eq 0 ]; then
echo "Command succeeded on attempt $i"
break
else
echo "Command failed on attempt $i"
if [ $i -eq 3 ]; then
echo "Command failed after 3 attempts"
exit 1
fi
sleep 1 # Optional: wait before retry
fi
done
In the middle of training process, SSH into one of the nodes, kill the torchrun process.
Expected result: final loss across all ranks should be the same.
Actual result: final loss on the node that experienced interruption is different from the other two nodes.