[BUG]ZeRO-2 + CPU Offload + overlap_comm=true, the IPG (Independent Partition Gradient) buckets are never populated.

Open XiDianZuoYun opened this issue 3 months ago • 1 comments

Describe the bug

With ZeRO-2 + CPU Offload + overlap_comm=true, the IPG (Independent Partition Gradient) buckets are never populated. During gradient reduction (reduce_ipg_grads), we consistently observe empty buckets (bucket.index=0 and len(bucket.buffer)=0). Earlier we also hit list index out of range (now avoided), but buckets remain empty.

We instrumented both our trainer and DeepSpeed ZeRO code. After backward, engine.optimizer.ipg_buckets[torch.float32] always shows elements=0 / params_len=0 / grads_len=0 while contiguous_gradients=true/false (tried both), partition_gradients=true, overlap_comm=true.

Forward/loss is normal and loss decreases, so gradients flow somewhere, but they never reach the IPG buckets.

To Reproduce

Launcher: torchrun (not deepspeed launcher)
Training: multi-task loop. Each iteration has multiple tasks. We run forward per task, and either:
1. backward per task (engine.backward(loss_task)), with GAS set to the number of tasks in this iteration; or
2. sum all task losses and backward once, with GAS=1. Both strategies reproduce the empty IPG buckets.
zero_grad: In DeepSpeed mode, we call engine.zero_grad() once per iteration (outside the per-task loop) to avoid clearing buckets during accumulation.
GAS: We set engine.set_gradient_accumulation_steps(num_tasks_in_this_iter) so GAS matches actual backward calls in that iteration.

Pseudo-code:

# each train iter
engine.zero_grad()
engine.set_gradient_accumulation_steps(num_tasks)
for task, batch in data_batch.items():
    batch, loss_dict = model(batch, mode='train', task=task)
    loss = sum(loss_dict.values())
    engine.backward(loss)
# later engine.step() is called in a callback

Key logs (from our instrumentation):

comm_type: torch.float32, bucket.index=0, len(bucket.buffer)=0
contig=True: params_len=0, elements=0
[DeepSpeed] bucket.buffer out-of-range: bucket.index=0, len(bucket.buffer)=0
[IPG][trainer] dtype=torch.float32 elements=0 params_len=0 grads_len=0 idx=0 buf_len=0 contig=True part_grads=True overlap=True

We also added prints inside reduce_independent_p_g_buckets_and_remove_grads (enter/swap/copy-before-after/append lengths), but those prints never appear, suggesting the fill path didn’t run or was short-circuited.

Expected behavior

With ZeRO-2 + CPU offload + overlap_comm=true, reduce_independent_p_g_buckets_and_remove_grads should be triggered during backward hooks and populate the IPG buckets so that bucket.params/grads/elements > 0.
reduce_ipg_grads should not encounter empty buffers, or should safely handle fallback without empty buckets repeatedly.

ds_report output

We will attach full ds_report output when submitting (currently not available in this environment).

Screenshots

N/A (logs attached above).

System info

OS: Linux 5.4.0-125-generic
Python: 3.8 (Conda env)
DeepSpeed: 0.17.6
PyTorch: 2.0.x (launched with torchrun)
GPUs / Interconnect: see ds_report

Launcher context

torchrun (not the deepspeed launcher)

Docker context

Non-official DeepSpeed Docker; using a Conda-based image. We can provide image details if needed.

Configuration

Current DeepSpeed config (also tried variants; issue persists):

{
  "train_batch_size": 1,
  "gradient_accumulation_steps": 1,
  "optimizer": {
    "type": "adamw",
    "params": { "lr": 1e-4, "betas": [0.9, 0.999], "eps": 1e-8, "weight_decay": 0.0 }
  },
  "zero_optimization": {
    "stage": 2,
    "offload_optimizer": { "device": "cpu", "pin_memory": true },
    "reduce_bucket_size": 50000000,
    "allgather_bucket_size": 50000000,
    "overlap_comm": true,
    "contiguous_gradients": false
  },
  "fp16": { "enabled": false },
  "bf16": { "enabled": false },
  "checkpoint_activations": true,
  "activation_checkpointing": {
    "partition_activations": true,
    "contiguous_memory_optimization": true,
    "cpu_checkpointing": false,
    "synchronize_checkpoint_boundary": true,
    "number_checkpoints": 0,
    "profile": false
  }
}

We also tried:

contiguous_gradients=true (CPU offload prefers it) and false; both result in empty buckets.
Different reduce_bucket_size/allgather_bucket_size values.
Letting DeepSpeed create the optimizer (DeepSpeedCPUAdam) vs passing in AdamW.
fp16/bf16 on/off (our model ultimately needs full FP32, so currently off).
Per-task backward with GAS=task_num vs single backward with GAS=1.

Additional context

Multi-task training loop. We ensured:
- No inner-loop zero_grad when using DeepSpeed; only once per iteration.
- GAS matches the number of backward calls in this iteration.
- Fixed earlier issues (non-contiguous parameters, ZeRO-Offload with external optimizer warning, CPU/GPU dtype mismatch in inputs).

What we’d like help with:

Under ZeRO-2 + CPU offload + overlap_comm=true, in what conditions would reduce_independent_p_g_buckets_and_remove_grads not be triggered so that IPG buckets remain empty?
Are there known incompatibilities with multi-task/multi-microstep backward ordering and IPG?
Is using torchrun instead of the deepspeed launcher a factor for IPG trigger paths?
Any env flags or debug toggles to verify backward hooks installation and triggers for IPG filling?

We can provide a minimal reproducible script (simplified multi-task iteration + per-task backward + torchrun) and ds_report upon request. Thank you!

Sep 22 '25 15:09 XiDianZuoYun

@XiDianZuoYun I'd like to work on this issue. Can you provide a repro script?

Sep 27 '25 05:09 therealnaveenkamal