easybuild-easyblocks icon indicating copy to clipboard operation
easybuild-easyblocks copied to clipboard

Filter out unsupported compute capabilities from environment when loading CUDA?

Open terjekv opened this issue 3 years ago • 3 comments
trafficstars

When building older CUDA-enabled software with a newer configuration (to support newer hardware), one can run into a situation where cuda_compute_capabilities contains values unsupported by the older software, breaking builds. One example would be building Clang/11.0.1-gcccuda-2020b with cuda-compute-capabilities=6.1,7.0,7.5,8.6. Here clang will complain that the cuda in question (correctly) doesn't have any idea what sm86 is, which kills the build. Building as eb Clang-11.0.1.gcccuda-2020b.eb --cuda-compute-capabilities=6.1,7.0,7.5 however works fine.

Two things about this problem:

  1. It may not be readily apparent to most users why clang fails with errors pointing to sm86.
  2. It is a generic problem with CUDA not gracefully handling unknown compute capabilities, it is not inherently a clang issue.

One suggestion would be to filter out unsupported cuda_compute_capabilities from the configuration once cuda is loaded. This would ensure that a given configuration would be compatible with both older and newer CUDA versions, while also gracefully handling unsupported versions.

@smoors found the following way to runtime check what versions a given cuda installation supports:

$ ml CUDA/11.4.1
$ nvcc --list-gpu-arch
compute_35
compute_37
compute_50
compute_52
compute_53
compute_60
compute_61
compute_62
compute_70
compute_72
compute_75
compute_80
compute_86
compute_87

However...

$ module load CUDA/10.1.105-GCC-8.2.0-2.31.1; nvcc --list-gpu-arch
nvcc fatal   : Unknown option '-list-gpu-arch'
$ module purge; module load CUDA/11.1.1-GCC-10.2.0; nvcc --list-gpu-arch
compute_35
compute_37
compute_50
compute_52
compute_53
compute_60
compute_61
compute_62
compute_70
compute_72
compute_75
compute_80
compute_86

Hrmpf. Sad panda.

terjekv avatar May 02 '22 21:05 terjekv

Hrmpf. Sad panda.

in terms of maintaining the code it's not that bad if we hardcode the capabilities for CUDA versions below 11, and use nvcc --list-gpu-arch going forward.

smoors avatar May 03 '22 06:05 smoors

Yeah, we can dump this into the CUDA easyblock and such, I just wish we didn't have to. :-p

(Also, why does the option error on CUDA <11 not quote the actuall option I listed, but instead strips off the first dash? That is not at all confusing.)

terjekv avatar May 03 '22 06:05 terjekv

Maybe there is a way of parsing the table at https://gist.github.com/ax3l/9489132 - at least to pre-generate the information for the easyblock (saves having to look it up manually).

dithwick avatar May 03 '22 14:05 dithwick