ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: multi-node training stuck for open-sora

Open ltm920716 opened this issue 7 months ago • 1 comments

Is there an existing issue for this bug?

  • [x] I have searched the existing issues

The bug has not been fixed in the latest main branch

  • [x] I have checked the latest main branch

Do you feel comfortable sharing a concise (minimal) script that reproduces the error? :)

Yes, I will share a minimal reproducible script.

🐛 Describe the bug

I use k8s as the schedule tool and can not get the ip files,so I run multi-node with torchrun like:

torchrun --nnodes 4 --nproc_per_node 8 --master_addr $MASTER_ADDR --master_port $MASTER_PORT --node-rank $NODE_RANK scripts/diffusion/train.py configs/diffusion/train/demo.py --dataset.data-path modified_data.csv

I run the command and it stucks,thanks

Environment

  • Base Image: pytorch/pytorch:2.4.0-cuda12.1-cudnn9-devel
git clone https://github.com/hpcaitech/Open-Sora.git
# commit d0cd5ac50da79e9a9d2285a952d4dcd806e6c8fc
cd Open-Sora

pip install -v .
pip install xformers==0.0.27.post2 --index-url https://download.pytorch.org/whl/cu121
pip install flash-attn --no-build-isolation

pip install opencv-python-headless
# pip install git+https://github.com/hpcaitech/TensorNVMe.git
git clone https://github.com/hpcaitech/TensorNVMe.git && cd TensorNVMe
pip install -r requirements.txt
pip install -v --no-cache-dir .

ltm920716 avatar Jun 16 '25 11:06 ltm920716

Hey there, I was able to reproduce the same "stuck" locally and it doesn’t look like an open sora bug. It's likely a rendezvous misconfiguration(matching/meet-up step for the pytorch ddp processes). In PyTorch DDP, make sure the following are exactly the same across all processes/pods:

  1. MASTER_ADDR
  2. MASTER_PORT
  3. RDZV_ID(fixed job name)
  4. The network interface so they talk over the same NIC

If any one of those differs, workers will sit at the barrier and look “stuck”. On k8s, you can give the job a stable DNS name and use a consistent env in every pod, a headless service works well:

apiVersion: v1
kind: Service
metadata:
  name: training-headless
spec:
  clusterIP: None         # makes it headless
  selector:
    app: my-trainer      
  ports:
    - port: 29500
      name: rendezvous

and use the same rendezvous env in every pod:

export MASTER_ADDR=training-headless.default.svc.cluster.local
export MASTER_PORT=29500
export RDZV_ID=opensora-job-1234
export GLOO_SOCKET_IFNAME=eth0     
export NCCL_IB_DISABLE=1           

hyphen flags:

torchrun \
  --nnodes=4 \
  --nproc-per-node=8 \
  --node-rank=$NODE_RANK \
  --rdzv-backend=c10d \
  --rdzv-endpoint=${MASTER_ADDR}:${MASTER_PORT} \
  --rdzv-id=${RDZV_ID} \
  scripts/diffusion/train.py \
    configs/diffusion/train/demo.py \
    --dataset.data-path modified_data.csv

--node-rank should be unique per pod.

blondon1 avatar Jul 30 '25 15:07 blondon1