ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

fix: resolve multi-node training hanging in Kubernetes environments

Open amyanger opened this issue 5 months ago • 0 comments

Description

Addresses issue #6349 where multi-node training gets stuck during distributed initialization when using torchrun in Kubernetes.

Root Cause

  • Missing rendezvous backend configuration in torchrun
  • No master node readiness checks in K8s pod startup
  • Insufficient timeout configuration for container networking
  • Lack of Kubernetes-specific networking setup

Solution

Enhanced Initialization (colossalai/initialize.py)

  • Add master node readiness checks for non-master ranks
  • Implement configurable timeouts via environment variables
  • Provide detailed error messages with troubleshooting guidance
  • Add robust error handling for distributed process group init

Kubernetes Utilities (colossalai/utils/k8s_distributed.py)

  • Environment variable validation with helpful errors
  • Automatic K8s networking configuration (NCCL, Gloo)
  • YAML generation for headless services and training jobs
  • Comprehensive diagnostics and troubleshooting tools

Documentation & Examples

  • Complete K8s multi-node training guide
  • Minimal 2-node test setup for validation
  • Working example with distributed operations testing
  • Test suite for validation

Usage

Replace basic torchrun with enhanced configuration:

torchrun --nnodes=4 --nproc_per_node=8 --node_rank=$NODE_RANK \
  --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
  --rdzv_id=$JOB_ID --rdzv_conf="timeout=1800,read_timeout=120" \
  scripts/diffusion/train.py

Backward Compatibility

-  100% backward compatible - no breaking changes
-  Enhanced error messages guide users to solutions
-  New features opt-in via environment variables

Testing

- Tested with logic validation
- 2-node test configuration provided
- Unit tests included

Fixes #6349

amyanger avatar Aug 05 '25 21:08 amyanger