DeepSpeed [BUG]when use 'overlap_comm:True' w/ 'contiguous_gradients:True', grad

Describe the bug As title, when enable overlap_comm and contiguous_gradients together, grad_norm will be nan (or be a constant float value in the latest master code, w/ this pr:https://github.com/deepspeedai/DeepSpeed/pull/7171 , seems still not fixed the root cause of nan). But it works fine when 'overlap_comm:false' w/ 'contiguous_gradients:true' OR 'overlap_comm:true' w/ 'contiguous_gradients:false' . It seems some bug behind contiguous_gradients, maybe some memory copy conflict?

To Reproduce Steps to reproduce the behavior:

in my private dataset, i can always reproduce it.
I think we can discuss on this issue, more likely to have some code reviews to find the bug? because it is strongly have a connection between 'overlap_comm' and 'contiguous_gradients'.

Expected behavior A clear and concise description of what you expected to happen.

ds_report output Please run ds_report to give us details about your setup.

Screenshots If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

OS: [e.g. Ubuntu 18.04]
GPU count and types [e.g. two machines with x8 A100s each]
Interconnects (if applicable) [e.g., two machines connected with 100 Gbps IB]
Python version
Any other relevant info about your setup

Launcher context Are you launching your experiment with the deepspeed launcher, MPI, or something else?

Docker context Are you using a specific docker image that you can share?

Additional context Add any other context about the problem here.

Mar 31 '25 09:03 whlook

Hi @whlook , if you are using Nvidia GPU, can you run your code with CUDA sanitizer and share the output if any? TORCH_CUDA_SANITIZER=1 python your_code.py

FYI: https://pytorch.org/docs/stable/cuda._sanitizer.html

Mar 31 '25 17:03 hwchen2017

Hi @whlook, I'm facing a similar issue with AMD GPUs. But it only appears when I use flash-attention3. Can you share your env and fa3 env setting? Maybe the memory conflict is related to this?

Apr 01 '25 09:04 SkyHeroesS

[BUG]when use 'overlap_comm:True' w/ 'contiguous_gradients:True', grad_norm is NaN