[BUG] Problems when training in GH200 architecture
Describe the bug I'm getting the following message when running Deepspeed in GH200 nodes: ValueError: invalid literal for int() with base 10: '90a'
The same code run smoothly when I try it in a A100 chip
@leandro-ventimiglia I think this issue is related to PyTorch. Which version were you using?
@xylian86 I'm using pytorch 2.6.0. This is the ouput of ds_report:
UserWarning: Can't initialize NVML warnings.warn("Can't initialize NVML") [2025-09-24 19:14:05,492] [WARNING] [real_accelerator.py:209:get_accelerator] Setting accelerator to CPU. If you have GPU or other accelerator, we were unable to detect it.
DeepSpeed C++/CUDA extension op report
NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.
JIT compiled ops requires ninja ninja .................. [OKAY]
op name ................ installed .. compatible
deepspeed_not_implemented [NO] ....... [OKAY] [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-devel package with yum [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. async_io ............... [NO] ....... [NO] deepspeed_ccl_comm ..... [NO] ....... [OKAY] deepspeed_shm_comm ..... [NO] ....... [OKAY] cpu_adam ............... [NO] ....... [OKAY] fused_adam ............. [NO] ....... [OKAY]
DeepSpeed general environment info: torch install path ............... ['/lus/lfs1aip2/scratch/s5b/ventilean.s5b/miniforge3/envs/ray/lib/python3.10/site-packages/torch'] torch version .................... 2.6.0+cu126 deepspeed install path ........... ['/lus/lfs1aip2/scratch/s5b/ventilean.s5b/miniforge3/envs/ray/lib/python3.10/site-packages/deepspeed'] deepspeed info ................... 0.17.6+2585881a, 2585881a, master deepspeed wheel compiled w. ...... torch 2.6 shared memory (/dev/shm) size .... 166.00 GB
@leandro-ventimiglia You probably hit https://github.com/pytorch/pytorch/issues/144037. Please check if the stack trace matches and if the workaround there works for you.
@eternalNight Yes, this is the issue from old version of Pytorch.
@leandro-ventimiglia You can export TORCH_CUDA_ARCH_LIST=9.0 to solve it.
@leandro-ventimiglia - does this resolve the issue and can we close this?
Not really, I tried export TORCH_CUDA_ARCH_LIST=9.0 but I'm still having problems.
@leandro-ventimiglia You may need to pass this environment variable to your script. Alternatively, you can modify the following line in your code:
supported_sm = [int(arch.split('')[1]) for arch in torch.cuda.get_arch_list() if 'sm' in arch]
to
supported_sm = [int(arch.split('_')[1])
for arch in torch.cuda.get_arch_list() if 'sm_' in arch and not arch.endswith('a')]