DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[BUG] Problems when training in GH200 architecture

Open leandro-ventimiglia opened this issue 3 months ago • 7 comments

Describe the bug I'm getting the following message when running Deepspeed in GH200 nodes: ValueError: invalid literal for int() with base 10: '90a'

The same code run smoothly when I try it in a A100 chip

leandro-ventimiglia avatar Sep 24 '25 13:09 leandro-ventimiglia

@leandro-ventimiglia I think this issue is related to PyTorch. Which version were you using?

xylian86 avatar Sep 24 '25 16:09 xylian86

@xylian86 I'm using pytorch 2.6.0. This is the ouput of ds_report:

UserWarning: Can't initialize NVML warnings.warn("Can't initialize NVML") [2025-09-24 19:14:05,492] [WARNING] [real_accelerator.py:209:get_accelerator] Setting accelerator to CPU. If you have GPU or other accelerator, we were unable to detect it.

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

deepspeed_not_implemented [NO] ....... [OKAY] [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-devel package with yum [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. async_io ............... [NO] ....... [NO] deepspeed_ccl_comm ..... [NO] ....... [OKAY] deepspeed_shm_comm ..... [NO] ....... [OKAY] cpu_adam ............... [NO] ....... [OKAY] fused_adam ............. [NO] ....... [OKAY]

DeepSpeed general environment info: torch install path ............... ['/lus/lfs1aip2/scratch/s5b/ventilean.s5b/miniforge3/envs/ray/lib/python3.10/site-packages/torch'] torch version .................... 2.6.0+cu126 deepspeed install path ........... ['/lus/lfs1aip2/scratch/s5b/ventilean.s5b/miniforge3/envs/ray/lib/python3.10/site-packages/deepspeed'] deepspeed info ................... 0.17.6+2585881a, 2585881a, master deepspeed wheel compiled w. ...... torch 2.6 shared memory (/dev/shm) size .... 166.00 GB

leandro-ventimiglia avatar Sep 24 '25 19:09 leandro-ventimiglia

@leandro-ventimiglia You probably hit https://github.com/pytorch/pytorch/issues/144037. Please check if the stack trace matches and if the workaround there works for you.

eternalNight avatar Sep 25 '25 09:09 eternalNight

@eternalNight Yes, this is the issue from old version of Pytorch.

@leandro-ventimiglia You can export TORCH_CUDA_ARCH_LIST=9.0 to solve it.

xylian86 avatar Sep 25 '25 15:09 xylian86

@leandro-ventimiglia - does this resolve the issue and can we close this?

loadams avatar Oct 14 '25 15:10 loadams

Not really, I tried export TORCH_CUDA_ARCH_LIST=9.0 but I'm still having problems.

leandro-ventimiglia avatar Oct 14 '25 17:10 leandro-ventimiglia

@leandro-ventimiglia You may need to pass this environment variable to your script. Alternatively, you can modify the following line in your code:

supported_sm = [int(arch.split('')[1]) for arch in torch.cuda.get_arch_list() if 'sm' in arch]

to

supported_sm = [int(arch.split('_')[1])
                            for arch in torch.cuda.get_arch_list() if 'sm_' in arch and not arch.endswith('a')]

xylian86 avatar Oct 14 '25 22:10 xylian86