DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[BUG]DeepSpeed ZeRO-2 Offload: backward does not populate IPG buckets even though autograd graph reaches parameters (PyTorch 2.0 / Python 3.8 / DeepSpeed 0.17.6)

Open XiDianZuoYun opened this issue 3 months ago • 0 comments

Describe the bug When training with DeepSpeed ZeRO Stage 2 and optimizer offload to CPU, calling engine.backward(loss_) results in empty IPG buckets during gradient reduction (e.g., bucket.buffer: []). This leads to failures during the reduction phase.

Key facts observed during debugging:

The loss is differentiable and on CUDA: loss_.requires_grad == True, loss_.grad_fn is not None (e.g., <SumBackward0>). Before initialization, the model has many trainable parameters (e.g., 1083). After DeepSpeed initialization, parameter identity fully matches between engine.module.parameters() and the ZeRO optimizer’s bit16_groups (full overlap). The autograd graph traversal from loss_ finds many AccumulateGrad nodes (e.g., 63), proving the loss depends on parameters. Despite that, on engine.backward(loss_) the IPG reduction buckets are empty and the step fails. This reproduces both on single-process/single-GPU and multi-process/multi-GPU runs.

To Reproduce The following minimal procedure reliably reproduces the issue.

DeepSpeed config (ZeRO-2 + CPU offload). Using JSON like: json { "train_batch_size": 8, "gradient_accumulation_steps": 1, "zero_optimization": { "stage": 2, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "contiguous_gradients": true, "overlap_comm": true, "reduce_scatter": true }, "fp16": { "enabled": true, "loss_scale": 0, "loss_scale_window": 1000 }, "wall_clock_breakdown": true } Initialize the model with DeepSpeed. Before deepspeed.initialize, make all parameters contiguous in-place (important: replace only .data, do not replace nn.Parameter objects), and confirm there are many trainable parameters: python import torch, deepspeed

model = ... construct your model with many trainable parameters ...

with torch.no_grad(): for p in model.parameters(): if not p.is_contiguous(): p.data = p.data.contiguous()

model_params = [p for p in model.parameters() if p.requires_grad] print("[pre-init] #model_params:", len(model_params)) # e.g., 1083

Let DeepSpeed create the optimizer from JSON config; you can either

pass model_parameters or let DS collect them automatically.

engine, optimizer, _, _ = deepspeed.initialize( model=model, model_parameters=model_params, config="/path/to/zero2_offload_cpu.json" )

model = engine Immediately verify parameter identity overlaps ZeRO’s managed params (use bit16_groups for ZeRO-offload): python mp_ids = {id(p) for p in model.module.parameters() if p.requires_grad}

if hasattr(model.optimizer, 'bit16_groups'): ds_ids = {id(p) for group in model.optimizer.bit16_groups for p in group} else: ds_ids = {id(p) for g in model.optimizer.param_groups for p in g['params']}

print("[post-init] overlap:", len(mp_ids & ds_ids), "/", len(mp_ids)) # full overlap is observed Build a standard differentiable CUDA loss. Confirm autograd graph reaches parameters by counting AccumulateGrad nodes: python def count_accumulate_grad_nodes(loss): seen, stack, cnt = set(), [getattr(loss, "grad_fn", None)], 0 while stack: fn = stack.pop() if fn is None or fn in seen: continue seen.add(fn) if fn.class.name == "AccumulateGrad": cnt += 1 for nxt, _ in fn.next_functions: if nxt is not None: stack.append(nxt) return cnt

outputs = model(inputs) # forward using DeepSpeed engine

loss_ = criterion(outputs, targets) # standard differentiable loss on CUDA

print("[chk] AccumulateGrad nodes in loss graph:", count_accumulate_grad_nodes(loss_)) # e.g., 63 Call backward via the engine: python model.backward(loss_) # DeepSpeed engine backward Observe the failure in the reduction path mentioning empty IPG buckets, for example: yaml bucket.buffer: [], bucket.index: 0, bucket.elements: 0 Notes:

Forward always uses model(...) (the engine), not model.module(...). No .detach(), .item(), or torch.no_grad() are used in building the loss. Reproduces with single GPU and with multiple processes/GPUs (torchrun or deepspeed launcher). Expected behavior During engine.backward(loss_), ZeRO’s gradient hooks should be triggered and gradients should populate IPG buckets for reduction/communication. No empty-bucket condition should occur when the loss’ autograd graph clearly reaches parameters and parameter registration is correct. ds_report output N/A. Key environment details are listed below.

Screenshots N/A.

System info OS: Linux (x86_64) Python: 3.8 PyTorch: 2.0.0 DeepSpeed: 0.17.6 CUDA runtime installed: 11.6 PyTorch CUDA build: 11.7 (log shows: “Installed CUDA 11.6 does not match the version torch was compiled with 11.7 but APIs are compatible”) GPUs: Reproduces on single-GPU and multi-GPU setups on different machines Distributed: NCCL available and used Launcher context Reproduces with both torchrun and deepspeed launchers (multi-process, multi-GPU) and also single-process, single-GPU. Docker context Not using Docker (bare-metal environment). Additional context This is not caused by non-contiguous parameters: parameters are made contiguous in-place via p.data = p.data.contiguous() before initialization; no nn.Parameter objects are replaced. This is not a parameter-registration issue: identity overlap between engine.module.parameters() and ZeRO’s bit16_groups is full. This is not a disconnected autograd graph: loss_ is CUDA, differentiable, has grad_fn, and graph traversal finds many AccumulateGrad nodes (e.g., 63). Despite the above, ZeRO’s IPG buckets remain empty during backward. This suggests a bug in the ZeRO-2 offload reduction path under PyTorch 2.0 + DeepSpeed 0.17.6 where gradient hooks or bucket population is not triggered/consumed as expected. Request: guidance on whether this is a known issue in 0.17.6 with ZeRO-2 + CPU offload (and possibly with overlap_comm/reduce_scatter), and any recommended patches or configuration workarounds.

XiDianZuoYun avatar Sep 25 '25 18:09 XiDianZuoYun