TMVA fails to link to cudnn
Check duplicate issues.
- [ ] Checked for duplicates
Description
It seems that one can configure tmva into a state that cudnn is disabled, but it's still trying to link:
[ 78%] Linking CXX shared library ../../lib/libTMVA.so
cd /root/build/tmva/tmva && /usr/bin/cmake -E cmake_link_script CMakeFiles/TMVA.dir/link.txt --verbose=1
/usr/bin/c++ -fPIC -Wno-implicit-fallthrough -Wno-noexcept-type -pipe -Wshadow -Wall -W -Woverloaded-virtual
[...]
/usr/bin/ld: CMakeFiles/TMVA.dir/src/DNN/Architectures/Cudnn.cu.o: in function `TMVA::DNN::cudnnError(cudnnStatus_t, char const*, int, bool) [clone .part.0] [clone .constprop.1]':
tmpxft_0001274b_00000000-6_Cudnn.cudafe1.cpp:(.text+0x1b): undefined reference to `cudnnGetErrorString'
/usr/bin/ld: CMakeFiles/TMVA.dir/src/DNN/Architectures/Cudnn.cu.o: in function `TMVA::DNN::cudnnError(cudnnStatus_t, char const*, int, bool) [clone .part.0] [clone .constprop.2]':
tmpxft_0001274b_00000000-6_Cudnn.cudafe1.cpp:(.text+0x5b): undefined reference to `cudnnGetErrorString'
[...]
I believe the way to trick it is to set tmva-cudnn=ON. It looks like users were not supposed to touch it, instead they should have set -Dcudnn=On. Setting the latter, the following code runs:
https://github.com/root-project/root/blob/45f13f0c6e145b0ddef82bf049a43fbe4870381b/cmake/modules/SearchInstalledSoftware.cmake#L1638-L1654
Instead, I set tmva-cudnn directly, the code above doesn't run, and tmva fails to link because the location of cudnn is never discovered. Maybe, one of cudnn or tmva-cudnn should be removed, and only a single flag should enable or disable it.
Reproducer
On ubuntu with cuda and cudnn, I did:
(ROOT-CI) root@102b09e3cf56:~/build# cmake -Dtmva-gpu=On -Dtesting=On -Dtmva-cudnn=On -Dbuiltin_openui5=Off -Dclad=Off -Dgdml=Off -Dgeom=Off -Dopengl=Off -Droot7=Off -Dspectrum=Off -Droofit=Off -Dvdt=Off ../root
$ apt list --installed | grep cudnn
libcudnn9-cuda-12/unknown,now 9.3.0.75-1 amd64 [installed,upgradable to: 9.5.0.50-1]
libcudnn9-dev-cuda-12/unknown,now 9.3.0.75-1 amd64 [installed,upgradable to: 9.5.0.50-1]
ROOT version
Master
Installation method
Source
Operating system
Ubuntu with cuda-12-6
Additional context
No response
Could you be more specific? How does the behaviour show up?
Could you be more specific? How does the behaviour show up?
Now much more details in the description. The very short version is:
Remove either the tmva-cudnn or the cudnn flag, and let's only use a single one for all of ROOT.
Wait, actually there is no existing tmva-cudnn build option in RootBuildOptions.cmake.... So -Dtmva-cudnn=On in your reproducer is "illegal" as far as I can tell, hence your issue.
Wait, actually there is no existing
tmva-cudnnbuild option in RootBuildOptions.cmake.... So-Dtmva-cudnn=Onin your reproducer is "illegal" as far as I can tell, hence your issue.
Even if it is, why do we have two variables which have to be kept in sync to do one job? And if you look at the ROOT build options, tmva-cudnn very much looks like a logical extension of tmva-cpu, tmva-gpu, tmva-pymva, ..., so I guess it's logical that I got confused when I looked into CMake how to enable it. It seems that my mistake also tricked you. 🙂
And even if it was "illegal", why does ROOT configure correctly, and fails only when you build?
That's why I'm proposing to remove cudnn, and only go with tmva-cudnn. Let's make this an official build option, fail fast when it's not supported, and keep it off when it's not needed.
Yes this proposal makes sense to me, as you noticed the inconsistency tricked me too :laughing: