[BUG][Deepcompile] reduce_grad returns undefined tensor -> Inductor compilation fails (expected a proper tensor but got None)
Describe the bug During AOTAutograd backward compilation, DeepSpeed’s reduce_grad op returns an undefined tensor, but the graph rewrite pass rewires all downstream gradient usages to this output. As a result, Inductor/FakeTensor sees None as input to ops like aten.sum or reshape, causing compilation failure.
Error
torch._inductor.exc.InductorError: RuntimeError:
Expected a proper Tensor but got None (or an undefined Tensor in C++) for argument #0 'self'
Trigger path
- Backward graph: each parameter-grad node is rewritten to torch.ops.dc.reduce_grad.default(grad)
- All uses of the original grad are replaced by the output of this op
- Fx trace shows downstream ops (e.g., aten.sum(...,[0,1]), reshape) consuming the output of reduce_grad.
- c++ implementation returns at::Tensor() (undefined) in both:
- reduce_grad()
- reduce_grad_meta() This breaks FakeTensor propagation and Inductor lowering.
Root Cause reduce_grad is treated as a functional node in the graph, but its c++ kernel and meta kernel return a undefined tensor, which cannot be consumed by downstream ops.
Since the compiler rewrites all gradient uses to this output, the output must be a valid Tensor.
Question for maintainers
In DeepSpeed/csrc/compile/deepcompile.cpp, both reduce_grad(...) and reduce_grad_meta(...) currently return an undefined tensor (at::Tensor()).
Given that the graph rewrite redirects all downstream gradient uses to the output of this op, should these two functions instead return the input grad_tensor?
This would allow downstream ops (e.g., aten.sum, reshape) to receive a valid tensor and avoid FakeTensor/Inductor errors during compilation. Is returning grad_tensor the correct fix here, or is the intended semantics different?
Thanks for reporting! May I have a minimal reproducing script, especially the model structure that triggers the issue?
AFAIK, gradients are used solely to update weights in the optimizer, and the optimizer has its own way to find the gradients for each parameter tensor. But per your description, your backward graph has further calculations on the gradients. Having a minimal model that reproduces the issue can help us understand the problem better.
While it is possible to return the tensor that (will) hold the reduced gradients, the communication may occur far after the reduce_grad returns. Additional synchronization will be needed before further calculations on those gradients.