Improve method finding CUDA libraries
Describe your problem
Whilst installing Relion it seems the cmake configuration still uses the deprecated CUDA package to find dependencies. Unfortunately when using the HPC SDK from Nvidia the cmake CUDA package cannot find the dependencies due to the change in file structure. I have hit this issue elsewhere such as in Gromacs and requires moving from find_package(CUDA) to find_package(CUDAToolkit) which was introduced in cmake 3.17 (but actually needs cmake 3.26 to work properly with HPC SDK).
Having spent some time making changes I think the following branch may be beginnings of a solution which could be tidied up but would like to have some comment on approach and also whether upgrading cmake required version is suitable?
https://github.com/green-br/relion/tree/cudatoolkit_update
Environment:
- OS: OpenSUSE 15
- MPI runtime: Cray-mpich
- RELION version 4.0.1
- Memory: 900GB
- GPU: GH200
Dataset:
Not a runtime issue.
Job options:
Not a runtime issue.
Error message:
No real error message except for not finding the CUDA libraries. e.g.
>> 128 CMake Error: The following variables are used in this project, but they are set to NOTFOUND.
129 Please set them or make sure they are set and tested correctly in the CMake files:
130 CUDA_cufft_LIBRARY (ADVANCED)
Thank you very much for your contribution. Indeed this has been on our TODO list (https://github.com/3dem/relion/issues/1016) for a long time but we were unable to do anything concrete, so your patch is very useful.
I have several questions:
-
Is it possible to somehow keep the
CUDAvariable? I understand this conflicts with the module so it is reasonable to change the internal variable name, but we don't want to change user-facing arguments unless it is absolutely necessary. -
when using the HPC SDK from Nvidia the cmake CUDA package cannot find the dependencies due to the change in file structure
Does
FindCUDAfail even whenCMP0146is enabled? This is to understand the urgency of the problem. -
which was introduced in cmake 3.17 (but actually needs cmake 3.26 to work properly with HPC SDK).
I thought it was introduced in 3.10 (as stated in the above
CMP0146page). Dropping <= 3.9 is probably fine but requiring 3.17 or 3.26 might be too strict. Can we make it compatible with both versions by failing back toFindCUDAwhen CMake is old? -
Did you make sure NVCC compiler flags (e.g. OpenMP) are properly passed? This is critically important; without it, mutex locks in parallelization are disabled and the resulting binary is broken (e.g. https://github.com/3dem/relion/issues/1038).
To answer your questions:
- It maybe possible to use CUDA and then set a variable to store the option and then unset CUDA. Have proposed change in branch and will test to check it still works.
- If CMP0146 is enabled it still doesn't solve the finding of some of the CUDA libraries. Not sure it helps other than retire the
CUDApackage. - It seems it was deprecated in 3.10 but
FindCUDAToolkitseems to have been made available in 3.17 e.g. https://cmake.org/cmake/help/latest/module/FindCUDAToolkit.html - maybe be possible to wrap logic around the newer bits to keep the older bits - will take a look if old behaviour should stay. - I have just added a possible fix for the OpenMP support - will have to test.
I am working on this in #1016. Closing this issue.