how to match conda-forge CUDA to the local system when working on compute clusters with modules
My colleagues and I do a lot of work at supercomputing centers and want to use CUDA-enabled conda-forge packages. Do we have general guidlines for how to ensure the conda-forge env is compatible with the underlying CUDA from the supercomputing centers?
cc @jakirkham or anyon else, also @conda-forge/core
Given this guide
https://github.com/conda-forge/cuda-feedstock/blob/main/recipe/doc/end_user_run_guide.md
my guess is to match cuda-version to the major.minor cuda version on the system.
Also CC @carterbox who is working on this AFAIK.
conda-forge CUDA packages are self-contained. You don't need to do module load or anything like that to set up a CUDA environment. As long as all compute nodes have a CUDA 12.x driver installed, you can conda install all needed packages and use them.
Thanks @leofang. bump @eacharles for viz
See conda override Cuda
https://conda-forge.org/docs/user/tipsandtricks/#installing-cuda-enabled-packages-like-tensorflow-and-pytorch
CONDA_OVERRIDE_CUDA allows you to pretend you have a different driver version (which is the basis of __cuda virtual package) from the one installed in the system. It may or may not be what you need.
Yes. But in a cluster system, the login nodes often don’t have the same system packages, and could even have a different architecture than the nodes you are running analysis on
Or even worse, don't have a physical GPU or the driver installed. So, yes, in this case it makes sense to use the env var to build an environment.
The docs I linked to were written in the context of an HPC cluster workflow.
So if they can be improved, i think that would be very very constructive ^_^
It would be great if we can add micromamba and pixi to those docs eventually. Also as of the last time I checked, pixi ignored CONDA_OVERRIDE_CUDA if it was set to "" and would forcibly install the CUDA variant even if I don't want it.
This may be outside the scope of what we're discussing here, but it might also be worth mentioning that due to the large size of the CUDA dependencies, users in HPC environments may need to install their CUDA packages in a different directory than where they normally install packages, and just link to the documentation for how to do this. It's significantly easier for micromamba and pixi than for conda but can be a major stumbling block.
pixiignoredCONDA_OVERRIDE_CUDAif it was set to""and would forcibly install the CUDA variant even if I don't want it.
using empty strings for environment variables are a fragile corner case. It's ~possible (but hard) to make the distinction between unset and empty values stringently on unix, but windows: no way (unset == empty).
That's a good point, in that case I think maybe supporting something like "None" as a value would be great.
There's a draft CEP I need to finish clarifying all these cases.