torchft icon indicating copy to clipboard operation
torchft copied to clipboard

DDP models are different when training is interrupted

Open btian opened this issue 2 months ago • 1 comments

Hi folks, not sure if I'm doing anything wrong. I saw a problem where the final models across ranks are different when training is interrupted.

To reproduce:

Use the following script to launch train_ddp.py across 3 different nodes, 1 GPU per node.

pip install torchft_nightly-2025.7.27-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

export TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC=180

if [ -z "${RANK}" ] || [ "${RANK}" == "0" ]; then
    RUST_BACKTRACE=1 torchft_lighthouse --min_replicas 2 --quorum_tick_ms 100 --join_timeout_ms 10000 --bind 0.0.0.0:${PORT} &
    export TORCHFT_LIGHTHOUSE="http://localhost:${PORT}"
else
    export TORCHFT_LIGHTHOUSE="http://${MASTER_ADDR}:${PORT}"
fi

script_file="train_ddp.py"

cmd=(torchrun
    --nproc_per_node="$SLURM_GPUS_PER_NODE"
    --rdzv_backend c10d
    --rdzv_endpoint="localhost:0"
    "$script_file"
    --
    "$@")

# Print the command
echo "Executing: ${cmd[@]}"

# Execute the command
for ((i=1; i<=3; i++))
do
    "${cmd[@]}"
    if [ $? -eq 0 ]; then
        echo "Command succeeded on attempt $i"
        break
    else
        echo "Command failed on attempt $i"
        if [ $i -eq 3 ]; then
            echo "Command failed after 3 attempts"
            exit 1
        fi
        sleep 1 # Optional: wait before retry
    fi
done

In the middle of training process, SSH into one of the nodes, kill the torchrun process.

Expected result: final loss across all ranks should be the same.

Actual result: final loss on the node that experienced interruption is different from the other two nodes.

Image

btian avatar Sep 29 '25 21:09 btian