ColossalAI
ColossalAI copied to clipboard
fix: resolve multi-node training hanging in Kubernetes environments
Description
Addresses issue #6349 where multi-node training gets stuck during distributed initialization when using torchrun in Kubernetes.
Root Cause
- Missing rendezvous backend configuration in torchrun
- No master node readiness checks in K8s pod startup
- Insufficient timeout configuration for container networking
- Lack of Kubernetes-specific networking setup
Solution
Enhanced Initialization (colossalai/initialize.py)
- Add master node readiness checks for non-master ranks
- Implement configurable timeouts via environment variables
- Provide detailed error messages with troubleshooting guidance
- Add robust error handling for distributed process group init
Kubernetes Utilities (colossalai/utils/k8s_distributed.py)
- Environment variable validation with helpful errors
- Automatic K8s networking configuration (NCCL, Gloo)
- YAML generation for headless services and training jobs
- Comprehensive diagnostics and troubleshooting tools
Documentation & Examples
- Complete K8s multi-node training guide
- Minimal 2-node test setup for validation
- Working example with distributed operations testing
- Test suite for validation
Usage
Replace basic torchrun with enhanced configuration:
torchrun --nnodes=4 --nproc_per_node=8 --node_rank=$NODE_RANK \
--rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
--rdzv_id=$JOB_ID --rdzv_conf="timeout=1800,read_timeout=120" \
scripts/diffusion/train.py
Backward Compatibility
- 100% backward compatible - no breaking changes
- Enhanced error messages guide users to solutions
- New features opt-in via environment variables
Testing
- Tested with logic validation
- 2-node test configuration provided
- Unit tests included
Fixes #6349