[BUG]: multi-node training stuck for open-sora
Is there an existing issue for this bug?
- [x] I have searched the existing issues
The bug has not been fixed in the latest main branch
- [x] I have checked the latest main branch
Do you feel comfortable sharing a concise (minimal) script that reproduces the error? :)
Yes, I will share a minimal reproducible script.
🐛 Describe the bug
I use k8s as the schedule tool and can not get the ip files,so I run multi-node with torchrun like:
torchrun --nnodes 4 --nproc_per_node 8 --master_addr $MASTER_ADDR --master_port $MASTER_PORT --node-rank $NODE_RANK scripts/diffusion/train.py configs/diffusion/train/demo.py --dataset.data-path modified_data.csv
I run the command and it stucks,thanks
Environment
- Base Image: pytorch/pytorch:2.4.0-cuda12.1-cudnn9-devel
git clone https://github.com/hpcaitech/Open-Sora.git
# commit d0cd5ac50da79e9a9d2285a952d4dcd806e6c8fc
cd Open-Sora
pip install -v .
pip install xformers==0.0.27.post2 --index-url https://download.pytorch.org/whl/cu121
pip install flash-attn --no-build-isolation
pip install opencv-python-headless
# pip install git+https://github.com/hpcaitech/TensorNVMe.git
git clone https://github.com/hpcaitech/TensorNVMe.git && cd TensorNVMe
pip install -r requirements.txt
pip install -v --no-cache-dir .
Hey there, I was able to reproduce the same "stuck" locally and it doesn’t look like an open sora bug. It's likely a rendezvous misconfiguration(matching/meet-up step for the pytorch ddp processes). In PyTorch DDP, make sure the following are exactly the same across all processes/pods:
- MASTER_ADDR
- MASTER_PORT
- RDZV_ID(fixed job name)
- The network interface so they talk over the same NIC
If any one of those differs, workers will sit at the barrier and look “stuck”. On k8s, you can give the job a stable DNS name and use a consistent env in every pod, a headless service works well:
apiVersion: v1
kind: Service
metadata:
name: training-headless
spec:
clusterIP: None # makes it headless
selector:
app: my-trainer
ports:
- port: 29500
name: rendezvous
and use the same rendezvous env in every pod:
export MASTER_ADDR=training-headless.default.svc.cluster.local
export MASTER_PORT=29500
export RDZV_ID=opensora-job-1234
export GLOO_SOCKET_IFNAME=eth0
export NCCL_IB_DISABLE=1
hyphen flags:
torchrun \
--nnodes=4 \
--nproc-per-node=8 \
--node-rank=$NODE_RANK \
--rdzv-backend=c10d \
--rdzv-endpoint=${MASTER_ADDR}:${MASTER_PORT} \
--rdzv-id=${RDZV_ID} \
scripts/diffusion/train.py \
configs/diffusion/train/demo.py \
--dataset.data-path modified_data.csv
--node-rank should be unique per pod.