DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[BUG] CUDA error: an illegal memory access was encountered with Adam optimizer on H100

Open szhengac opened this issue 1 year ago • 2 comments

Describe the bug On H100 SXM5, Adam optimizer kernel standalone would lead to CUDA error: an illegal memory access was encountered with certain tensor size such as 2359332864. The GPU has 80GB, while 2359332864 elements would use 35GB at most.

To Reproduce

import torch
from deepspeed.ops.adam import FusedAdam

t = torch.zeros(2359332864, dtype=torch.float, device='cuda')
t.grad = torch.zeros_like(t)
params = [t]
optimizer = FusedAdam(params)
optimizer.step()
torch.cuda.synchronize()

Expected behavior

RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [YES] ...... [OKAY]
cpu_adagrad ............ [YES] ...... [OKAY]
cpu_adam ............... [YES] ...... [OKAY]
fused_adam ............. [YES] ...... [OKAY]
fused_lamb ............. [YES] ...... [OKAY]
quantizer .............. [YES] ...... [OKAY]
random_ltd ............. [YES] ...... [OKAY]
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [YES] ...... [OKAY]
transformer ............ [YES] ...... [OKAY]
stochastic_transformer . [YES] ...... [OKAY]
transformer_inference .. [YES] ...... [OKAY]
utils .................. [YES] ...... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch']
torch version .................... 1.14.0a0+44dac51
deepspeed install path ........... ['/usr/local/lib/python3.8/dist-packages/deepspeed']
deepspeed info ................... 0.9.2+a094c976, a094c976, master
torch cuda version ............... 12.0
torch hip version ................ None
nvcc version ..................... 12.0
deepspeed wheel compiled w. ...... torch 1.14, cuda 12.0

Screenshots If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

  • OS: Ubuntu 20.04.6 LTS
  • GPU count and types: A single machine with 8 H100 SXM5
  • Interconnects: NVSwitch
  • Python version: Python 3.8.10

Launcher context No need to use launcher

Docker context nvcr.io/nvidia/pytorch:23.02-py3

Additional context Add any other context about the problem here.

szhengac avatar May 02 '23 23:05 szhengac

Hi @szhengac I was able to reproduce this issue with both DeepSpeed FusedAdam and Apex FusedAdam. I see that you opened an issue there as well. We'll be sure to update the DeepSpeed version of FusedAdam if a solution is found.

jomayeri avatar May 04 '23 20:05 jomayeri

https://github.com/NVIDIA/apex/issues/1654

jomayeri avatar May 04 '23 20:05 jomayeri

Closing for now. Follow the link to the Apex issue page for further resolution.

jomayeri avatar May 15 '23 17:05 jomayeri

Does Deepspeed FusedAdam has the same fixes as torch https://github.com/pytorch/pytorch/pull/101760? I recently ran into this issue and had to switch to CPU Adam

chiragjn avatar Dec 16 '23 07:12 chiragjn