DeepSpeed [BUG] Problems with MiCS training

Describe the bug Hi, i'm trying to run pretraining gpt model with Megatron-DeepSpeed pipeline and Zero-3 + Mics sharding strategy, but got next log:

WARNING: Runtime Error while waiting the collective all-gather, possibly due to the _IllegalWork
[2024-02-02 16:28:29,946] [INFO] [logging.py:96:log_dist] [Rank 0] Error message: Illegal to call wait on IllegalWork object

If split model across 2 nodes ("mics_shard_size": 16) and set "mics_hierarchical_params_gather": true this error appears explicitly, without warning:

File "/usr/local/lib64/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1650, in __getattribute__
  raise RuntimeError(f"Illegal to call {name} on IllegalWork object")
RuntimeError: Illegal to call wait on IllegalWork object

Despite the fact that in the first case learning formally continues, in the first 10-20 iteration the model catches a lot of overflow by loss scaler and stops learning normally. At the same time, pure Zero-3 learns without errors, overflow and other problems. The error occurs on any number of nodes, even on single node.

I am using my fork of the Megatron-DeepSpeed framework with minimal changes to run with mics, which unfortunately I cannot share. But I am sure that this is not a problem of the training code, because all other Zero modes work correctly.

My deepspeed config:

{
  "train_batch_size" : 8,
  "train_micro_batch_size_per_gpu": 1536,
  "steps_per_print": 10,
  "zero_optimization": {
    "stage": 3,
    "reduce_scatter" : true,
    "overlap_comm": true,
    "allgather_partitions" : true,
    "reduce_bucket_size": 5e8,
    "allgather_bucket_size" : 5e8,
    "stage3_max_live_parameters" : 1e9,
    "stage3_prefetch_bucket_size" : 5e8,
    "stage3_max_reuse_distance" : 1e9,
    "stage3_param_persistence_threshold": 1e6,
    "mics_shard_size": 8,
    "mics_hierarchical_params_gather": false
  },
  "fp16": {
    "enabled": true,
    "loss_scale": 0,
    "initial_scale_power": 12,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "comms_logger": {
    "enabled": true,
    "verbose": true,
    "prof_all": true,
    "debug": false
  },
  "gradient_clipping" : 1.0,
  "wall_clock_breakdown" : true
}

ds_report output

DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib64/python3.9/site-packages/torch']
torch version .................... 2.1.0+rocm5.6
deepspeed install path ........... ['/usr/local/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.13.1, unknown, unknown
torch cuda version ............... None
torch hip version ................ 5.6.31061-8c743ae5d
nvcc version ..................... None
deepspeed wheel compiled w. ...... torch 2.1, hip 5.6

System info:

OS: AlmaLinux 9
1-24 nodes with 8x MI100 GPUs
Python version: 3.11

Launcher context I'm launching my experiment with the torchrun

Can someone suggest a reason for this behavior? Judging by issues, this behavior is very rare. Is this a problem of MiCS logic, my environment, or something else?