conda-forge.github.io icon indicating copy to clipboard operation
conda-forge.github.io copied to clipboard

how to match conda-forge CUDA to the local system when working on compute clusters with modules

Open beckermr opened this issue 6 months ago • 12 comments

My colleagues and I do a lot of work at supercomputing centers and want to use CUDA-enabled conda-forge packages. Do we have general guidlines for how to ensure the conda-forge env is compatible with the underlying CUDA from the supercomputing centers?

cc @jakirkham or anyon else, also @conda-forge/core

beckermr avatar Jul 08 '25 18:07 beckermr

Given this guide

https://github.com/conda-forge/cuda-feedstock/blob/main/recipe/doc/end_user_run_guide.md

my guess is to match cuda-version to the major.minor cuda version on the system.

Also CC @carterbox who is working on this AFAIK.

beckermr avatar Jul 08 '25 18:07 beckermr

conda-forge CUDA packages are self-contained. You don't need to do module load or anything like that to set up a CUDA environment. As long as all compute nodes have a CUDA 12.x driver installed, you can conda install all needed packages and use them.

leofang avatar Jul 08 '25 18:07 leofang

Thanks @leofang. bump @eacharles for viz

beckermr avatar Jul 08 '25 18:07 beckermr

See conda override Cuda

https://conda-forge.org/docs/user/tipsandtricks/#installing-cuda-enabled-packages-like-tensorflow-and-pytorch

hmaarrfk avatar Jul 09 '25 03:07 hmaarrfk

CONDA_OVERRIDE_CUDA allows you to pretend you have a different driver version (which is the basis of __cuda virtual package) from the one installed in the system. It may or may not be what you need.

leofang avatar Jul 09 '25 03:07 leofang

Yes. But in a cluster system, the login nodes often don’t have the same system packages, and could even have a different architecture than the nodes you are running analysis on

hmaarrfk avatar Jul 09 '25 03:07 hmaarrfk

Or even worse, don't have a physical GPU or the driver installed. So, yes, in this case it makes sense to use the env var to build an environment.

leofang avatar Jul 09 '25 03:07 leofang

The docs I linked to were written in the context of an HPC cluster workflow.

So if they can be improved, i think that would be very very constructive ^_^

hmaarrfk avatar Jul 09 '25 15:07 hmaarrfk

It would be great if we can add micromamba and pixi to those docs eventually. Also as of the last time I checked, pixi ignored CONDA_OVERRIDE_CUDA if it was set to "" and would forcibly install the CUDA variant even if I don't want it.

This may be outside the scope of what we're discussing here, but it might also be worth mentioning that due to the large size of the CUDA dependencies, users in HPC environments may need to install their CUDA packages in a different directory than where they normally install packages, and just link to the documentation for how to do this. It's significantly easier for micromamba and pixi than for conda but can be a major stumbling block.

danielnachun avatar Jul 09 '25 22:07 danielnachun

pixi ignored CONDA_OVERRIDE_CUDA if it was set to "" and would forcibly install the CUDA variant even if I don't want it.

using empty strings for environment variables are a fragile corner case. It's ~possible (but hard) to make the distinction between unset and empty values stringently on unix, but windows: no way (unset == empty).

h-vetinari avatar Jul 09 '25 22:07 h-vetinari

That's a good point, in that case I think maybe supporting something like "None" as a value would be great.

danielnachun avatar Jul 09 '25 23:07 danielnachun

There's a draft CEP I need to finish clarifying all these cases.

jaimergp avatar Jul 10 '25 06:07 jaimergp