easybuild-easyblocks
easybuild-easyblocks copied to clipboard
Filter out unsupported compute capabilities from environment when loading CUDA?
When building older CUDA-enabled software with a newer configuration (to support newer hardware), one can run into a situation where cuda_compute_capabilities contains values unsupported by the older software, breaking builds. One example would be building Clang/11.0.1-gcccuda-2020b with cuda-compute-capabilities=6.1,7.0,7.5,8.6. Here clang will complain that the cuda in question (correctly) doesn't have any idea what sm86 is, which kills the build. Building as eb Clang-11.0.1.gcccuda-2020b.eb --cuda-compute-capabilities=6.1,7.0,7.5 however works fine.
Two things about this problem:
- It may not be readily apparent to most users why clang fails with errors pointing to sm86.
- It is a generic problem with CUDA not gracefully handling unknown compute capabilities, it is not inherently a clang issue.
One suggestion would be to filter out unsupported cuda_compute_capabilities from the configuration once cuda is loaded. This would ensure that a given configuration would be compatible with both older and newer CUDA versions, while also gracefully handling unsupported versions.
@smoors found the following way to runtime check what versions a given cuda installation supports:
$ ml CUDA/11.4.1
$ nvcc --list-gpu-arch
compute_35
compute_37
compute_50
compute_52
compute_53
compute_60
compute_61
compute_62
compute_70
compute_72
compute_75
compute_80
compute_86
compute_87
However...
$ module load CUDA/10.1.105-GCC-8.2.0-2.31.1; nvcc --list-gpu-arch
nvcc fatal : Unknown option '-list-gpu-arch'
$ module purge; module load CUDA/11.1.1-GCC-10.2.0; nvcc --list-gpu-arch
compute_35
compute_37
compute_50
compute_52
compute_53
compute_60
compute_61
compute_62
compute_70
compute_72
compute_75
compute_80
compute_86
Hrmpf. Sad panda.
Hrmpf. Sad panda.
in terms of maintaining the code it's not that bad if we hardcode the capabilities for CUDA versions below 11, and use nvcc --list-gpu-arch going forward.
Yeah, we can dump this into the CUDA easyblock and such, I just wish we didn't have to. :-p
(Also, why does the option error on CUDA <11 not quote the actuall option I listed, but instead strips off the first dash? That is not at all confusing.)
Maybe there is a way of parsing the table at https://gist.github.com/ax3l/9489132 - at least to pre-generate the information for the easyblock (saves having to look it up manually).