DeepSpeed
DeepSpeed copied to clipboard
[BUG] Problems with MiCS training
Describe the bug Hi, i'm trying to run pretraining gpt model with Megatron-DeepSpeed pipeline and Zero-3 + Mics sharding strategy, but got next log:
WARNING: Runtime Error while waiting the collective all-gather, possibly due to the _IllegalWork
[2024-02-02 16:28:29,946] [INFO] [logging.py:96:log_dist] [Rank 0] Error message: Illegal to call wait on IllegalWork object
If split model across 2 nodes ("mics_shard_size": 16) and set "mics_hierarchical_params_gather": true this error appears explicitly, without warning:
File "/usr/local/lib64/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1650, in __getattribute__
raise RuntimeError(f"Illegal to call {name} on IllegalWork object")
RuntimeError: Illegal to call wait on IllegalWork object
Despite the fact that in the first case learning formally continues, in the first 10-20 iteration the model catches a lot of overflow by loss scaler and stops learning normally. At the same time, pure Zero-3 learns without errors, overflow and other problems. The error occurs on any number of nodes, even on single node.
I am using my fork of the Megatron-DeepSpeed framework with minimal changes to run with mics, which unfortunately I cannot share. But I am sure that this is not a problem of the training code, because all other Zero modes work correctly.
My deepspeed config:
{
"train_batch_size" : 8,
"train_micro_batch_size_per_gpu": 1536,
"steps_per_print": 10,
"zero_optimization": {
"stage": 3,
"reduce_scatter" : true,
"overlap_comm": true,
"allgather_partitions" : true,
"reduce_bucket_size": 5e8,
"allgather_bucket_size" : 5e8,
"stage3_max_live_parameters" : 1e9,
"stage3_prefetch_bucket_size" : 5e8,
"stage3_max_reuse_distance" : 1e9,
"stage3_param_persistence_threshold": 1e6,
"mics_shard_size": 8,
"mics_hierarchical_params_gather": false
},
"fp16": {
"enabled": true,
"loss_scale": 0,
"initial_scale_power": 12,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},
"comms_logger": {
"enabled": true,
"verbose": true,
"prof_all": true,
"debug": false
},
"gradient_clipping" : 1.0,
"wall_clock_breakdown" : true
}
ds_report output
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib64/python3.9/site-packages/torch']
torch version .................... 2.1.0+rocm5.6
deepspeed install path ........... ['/usr/local/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.13.1, unknown, unknown
torch cuda version ............... None
torch hip version ................ 5.6.31061-8c743ae5d
nvcc version ..................... None
deepspeed wheel compiled w. ...... torch 2.1, hip 5.6
System info:
- OS: AlmaLinux 9
- 1-24 nodes with 8x MI100 GPUs
- Python version: 3.11
Launcher context I'm launching my experiment with the torchrun
Can someone suggest a reason for this behavior? Judging by issues, this behavior is very rare. Is this a problem of MiCS logic, my environment, or something else?
@LoggerHead22, we will look into this issue. As an alternative (stopgap measure), please consider using hpZ component of ZeRO++.
Is there any update on this, @samadejacobs ?
@samadejacobs also curious about an update - seeing the same issue when using pytorch 2.2 + cuda 12 + nvidia gpus
+1 Also same issue pytorch 2.4, cuda 12.6, p4d.24xlarge