root TMVA fails to link to cudnn

Check duplicate issues.

[ ] Checked for duplicates

Description

It seems that one can configure tmva into a state that cudnn is disabled, but it's still trying to link:

[ 78%] Linking CXX shared library ../../lib/libTMVA.so
cd /root/build/tmva/tmva && /usr/bin/cmake -E cmake_link_script CMakeFiles/TMVA.dir/link.txt --verbose=1
/usr/bin/c++ -fPIC  -Wno-implicit-fallthrough -Wno-noexcept-type -pipe  -Wshadow -Wall -W -Woverloaded-virtual
[...]
/usr/bin/ld: CMakeFiles/TMVA.dir/src/DNN/Architectures/Cudnn.cu.o: in function `TMVA::DNN::cudnnError(cudnnStatus_t, char const*, int, bool) [clone .part.0] [clone .constprop.1]':
tmpxft_0001274b_00000000-6_Cudnn.cudafe1.cpp:(.text+0x1b): undefined reference to `cudnnGetErrorString'
/usr/bin/ld: CMakeFiles/TMVA.dir/src/DNN/Architectures/Cudnn.cu.o: in function `TMVA::DNN::cudnnError(cudnnStatus_t, char const*, int, bool) [clone .part.0] [clone .constprop.2]':
tmpxft_0001274b_00000000-6_Cudnn.cudafe1.cpp:(.text+0x5b): undefined reference to `cudnnGetErrorString'
[...]

I believe the way to trick it is to set tmva-cudnn=ON. It looks like users were not supposed to touch it, instead they should have set -Dcudnn=On. Setting the latter, the following code runs: https://github.com/root-project/root/blob/45f13f0c6e145b0ddef82bf049a43fbe4870381b/cmake/modules/SearchInstalledSoftware.cmake#L1638-L1654

Instead, I set tmva-cudnn directly, the code above doesn't run, and tmva fails to link because the location of cudnn is never discovered. Maybe, one of cudnn or tmva-cudnn should be removed, and only a single flag should enable or disable it.

Reproducer

On ubuntu with cuda and cudnn, I did:

(ROOT-CI) root@102b09e3cf56:~/build# cmake -Dtmva-gpu=On -Dtesting=On -Dtmva-cudnn=On -Dbuiltin_openui5=Off -Dclad=Off -Dgdml=Off -Dgeom=Off -Dopengl=Off -Droot7=Off -Dspectrum=Off -Droofit=Off -Dvdt=Off ../root

$ apt list --installed | grep cudnn
libcudnn9-cuda-12/unknown,now 9.3.0.75-1 amd64 [installed,upgradable to: 9.5.0.50-1]
libcudnn9-dev-cuda-12/unknown,now 9.3.0.75-1 amd64 [installed,upgradable to: 9.5.0.50-1]

ROOT version

Master

Installation method

Source

Operating system

Ubuntu with cuda-12-6

Additional context

No response

Oct 18 '24 16:10 hageboeck

Could you be more specific? How does the behaviour show up?

Oct 20 '24 07:10 dpiparo

Could you be more specific? How does the behaviour show up?

Now much more details in the description. The very short version is: Remove either the tmva-cudnn or the cudnn flag, and let's only use a single one for all of ROOT.

Oct 20 '24 10:10 hageboeck

Wait, actually there is no existing tmva-cudnn build option in RootBuildOptions.cmake.... So -Dtmva-cudnn=On in your reproducer is "illegal" as far as I can tell, hence your issue.

Oct 24 '24 13:10 guitargeek

Wait, actually there is no existing tmva-cudnn build option in RootBuildOptions.cmake.... So -Dtmva-cudnn=On in your reproducer is "illegal" as far as I can tell, hence your issue.

Even if it is, why do we have two variables which have to be kept in sync to do one job? And if you look at the ROOT build options, tmva-cudnn very much looks like a logical extension of tmva-cpu, tmva-gpu, tmva-pymva, ..., so I guess it's logical that I got confused when I looked into CMake how to enable it. It seems that my mistake also tricked you. 🙂 And even if it was "illegal", why does ROOT configure correctly, and fails only when you build?

That's why I'm proposing to remove cudnn, and only go with tmva-cudnn. Let's make this an official build option, fail fast when it's not supported, and keep it off when it's not needed.

Oct 24 '24 14:10 hageboeck

Yes this proposal makes sense to me, as you noticed the inconsistency tricked me too :laughing:

Oct 24 '24 15:10 guitargeek