[BUG]when use 'overlap_comm:True' w/ 'contiguous_gradients:True', grad_norm is NaN
Describe the bug As title, when enable overlap_comm and contiguous_gradients together, grad_norm will be nan (or be a constant float value in the latest master code, w/ this pr:https://github.com/deepspeedai/DeepSpeed/pull/7171 , seems still not fixed the root cause of nan). But it works fine when 'overlap_comm:false' w/ 'contiguous_gradients:true' OR 'overlap_comm:true' w/ 'contiguous_gradients:false' . It seems some bug behind contiguous_gradients, maybe some memory copy conflict?
To Reproduce Steps to reproduce the behavior:
- in my private dataset, i can always reproduce it.
- I think we can discuss on this issue, more likely to have some code reviews to find the bug? because it is strongly have a connection between 'overlap_comm' and 'contiguous_gradients'.
Expected behavior A clear and concise description of what you expected to happen.
ds_report output
Please run ds_report to give us details about your setup.
Screenshots If applicable, add screenshots to help explain your problem.
System info (please complete the following information):
- OS: [e.g. Ubuntu 18.04]
- GPU count and types [e.g. two machines with x8 A100s each]
- Interconnects (if applicable) [e.g., two machines connected with 100 Gbps IB]
- Python version
- Any other relevant info about your setup
Launcher context
Are you launching your experiment with the deepspeed launcher, MPI, or something else?
Docker context Are you using a specific docker image that you can share?
Additional context Add any other context about the problem here.
Hi @whlook , if you are using Nvidia GPU, can you run your code with CUDA sanitizer and share the output if any?
TORCH_CUDA_SANITIZER=1 python your_code.py
FYI: https://pytorch.org/docs/stable/cuda._sanitizer.html
Hi @whlook, I'm facing a similar issue with AMD GPUs. But it only appears when I use flash-attention3. Can you share your env and fa3 env setting? Maybe the memory conflict is related to this?