easybuild-easyblocks Add easyblock for CUDA compatibility libraries

(created using eb --new-pr)

This implements the workflow to install the compatibility libraries from the driver run file: https://docs.nvidia.com/deploy/cuda-compatibility/index.html#manually-installing-from-runfile

ECs using this new EasyBlock: https://github.com/easybuilders/easybuild-easyconfigs/pull/15892

Note that the driver version needs to be compatible with the libraries. I added that as comments to the ECs. Successfully tested the 11.6 EC on PPC with "Driver Version: 440.64.00 CUDA Version: 10.2" in combination with CUDA/11.1.1-GCC-10.2.0

The following is a minimal example to test that:

main.cu:

#include <cuda.h>
int main() {
    int deviceCount;
    return cudaGetDeviceCount(&deviceCount) == cudaSuccess ? 0 : 1;
}

nvcc main.cu && ./a.out

Prior to loading the compat module the CUDA error returned is cudaErrorInitializationError (3). However on a machine without GPUs (and/or? the CUDA drivers) the error is cudaErrorInsufficientDriver (35).

I'm now successfully using this inside a parse_hook:

    if ec.name in ('CUDA', 'CUDAcore') and ec.toolchain.is_system_toolchain():
        if LooseVersion(ec.version) >= '11' and '-ml' in os.environ.get('EASYBUILD_INSTALLPATH', ''):
            ec.log.info("[parse hook] Adding CUDA 11.6 compat package")
            ec['dependencies'].append(('CUDAcompat', '11.6', '', True))

IMO it would make sense to add a sanity check step to the CUDA easyblock to (similar to the TensorFlow easyblock) first check for nvidia-smi on the machine and if that exists compile and run the above example program to check that the installed CUDA actually works. But I'm not fully sure as nvcc needs a "host compiler" and I'm not sure if we really have one at that point already or if it is later in fosscuda etc.

Jul 22 '22 11:07 Flamefire

To me it's really a problem that we can't check that the compat libraries actually work (due to not being able to compile/test a CUDA code). I wonder if we can get around this by using the JIT compiler?

According to https://stackoverflow.com/a/67754251 this might be possible...but if it is too much effort then we need a big fat warning that additional effort is required to verify if the installed libraries actually work (we can even spit out the code in the minimal example of the first comment that they need to compile with CUDA)

Jul 27 '22 10:07 ocaisa

To me it's really a problem that we can't check that the compat libraries actually work (due to not being able to compile/test a CUDA code). I wonder if we can get around this by using the JIT compiler?

I'd really like to focus on installation here. The compat libraries are very special and won't be used as a dependency of other "official" ECs. So I'd limit this to just getting the files and module in place and doing a sanity check for the common case which can be skipped for all other cases (e.g. the different build and deploy environment case) As installation of this is manual, so can be the tests as you already need to know which version to use.

Jul 28 '22 07:07 Flamefire

I'd really like to focus on installation here. The compat libraries are very special and won't be used as a dependency of other "official" ECs. So I'd limit this to just getting the files and module in place and doing a sanity check for the common case which can be skipped for all other cases (e.g. the different build and deploy environment case) As installation of this is manual, so can be the tests as you already need to know which version to use.

Ok, I see your point, but then we have to go the BFW route, and explain to people what to do next to verify the installation actually works for them, especially since we know there are cases where it will not (if their current driver is EOL) or may not work (their driver is not listed in the supported table)

Jul 28 '22 08:07 ocaisa

Updated this with the sanity-check and test for supported driver versions as per nvidia-smi

And yes this is a pretty complicated thing as the driver version AND the GPU has to be compatible. So well... At least the errors will be pretty specific.

Jul 29 '22 15:07 Flamefire

@Flamefire Some trivial stuff left, I will test this again now

Sep 30 '22 13:09 ocaisa

@Flamefire In https://github.com/easybuilders/easybuild-easyconfigs/pull/15892#issuecomment-1268594845 it sounds like you can say with some certainty that the compat libraries will fail on consumer cards. Is that correct? Would there be a way to recognise a consumer card via nvidia-smi in the easyblock (or maybe the converse, look for GRID or TESLA in nvidia-smi --query-gpu=name --format=csv,noheader)?

Unfortunately I don't have access to different cards to see what the options are here.

Oct 07 '22 10:10 ocaisa

you can say with some certainty that the compat libraries will fail on consumer cards. Is that correct?

Yes, from the NVIDIA site

Forward Compatibility is applicable only for systems with NVIDIA Data Center GPUs or select NGC Server Ready SKUs of RTX cards

I'm not keen on maintaining a list of compatible or incompatible cards in the easyblock though as "NVIDIA Data Center GPUs" is IMO a quite loose term. So rather leave this module as-is and let site admins deal with that.

IMO we should rather add a test compilation and run to the CUDA easyblock as I proposed earlier. Nothing in the "official" ECs depends on the CUDAcompat libs and it will always be a very site-specific decision to include them somewhere, just like e.g. the --with-slurm option in the OpenMPI ECs is not added by default.

But then again: It is almost impossible after we dropped fosscuda etc in favor of a system-level CUDA as we don't know the compiler that can/will be used for the host-compilation of a CUDA file.

Anyway some data points on possible names:

Tesla V100-SXM2-32GB
Tesla K80
NVIDIA A100-SXM4-40GB
GeForce GTX 1080 Ti (the only consumer card)

Oct 07 '22 11:10 Flamefire

easybuild-easyblocks easybuild-easyblocks copied to clipboard

Add easyblock for CUDA compatibility libraries

easybuild-easyblocks
easybuild-easyblocks copied to clipboard