root icon indicating copy to clipboard operation
root copied to clipboard

Fails to build with cuDNN version 9

Open lahwaacz opened this issue 9 months ago • 4 comments

Check duplicate issues.

  • [X] Checked for duplicates

Description

Building with cuDNN 9.0 or later results in the following errors:

/build/root/src/root-6.30.06/tmva/tmva/src/DNN/Architectures/Cudnn/RecurrentPropagation.cu(500): error: identifier "cudnnRNNForwardTraining" is undefined
        cudnnStatus_t status = cudnnRNNForwardTraining(
                               ^
          detected during instantiation of "void TMVA::DNN::TCudnn<AFloat>::RNNForward(const TMVA::DNN::TCudnn<AFloat>::Tensor_t &, const TMVA::DNN::TCudnn<AFloat>::Tensor_t &, const TMVA::DNN::TCudnn<AFloat>::Tensor_t &, const TMVA::DNN::TCudnn<AFloat>::Tensor_t &, TMVA::DNN::TCudnn<AFloat>::Tensor_t &, TMVA::DNN::TCudnn<AFloat>::Tensor_t &, TMVA::DNN::TCudnn<AFloat>::Tensor_t &, const TMVA::DNN::TCudnn<AFloat>::RNNDescriptors_t &, TMVA::DNN::TCudnn<AFloat>::RNNWorkspace_t &, bool) [with AFloat=Float_t]" at line 43 of /build/root/src/root-6.30.06/tmva/tmva/src/DNN/Architectures/Cudnn.cu

/build/root/src/root-6.30.06/tmva/tmva/src/DNN/Architectures/Cudnn/RecurrentPropagation.cu(513): error: identifier "cudnnRNNForwardInference" is undefined
        cudnnStatus_t status = cudnnRNNForwardInference(
                               ^
          detected during instantiation of "void TMVA::DNN::TCudnn<AFloat>::RNNForward(const TMVA::DNN::TCudnn<AFloat>::Tensor_t &, const TMVA::DNN::TCudnn<AFloat>::Tensor_t &, const TMVA::DNN::TCudnn<AFloat>::Tensor_t &, const TMVA::DNN::TCudnn<AFloat>::Tensor_t &, TMVA::DNN::TCudnn<AFloat>::Tensor_t &, TMVA::DNN::TCudnn<AFloat>::Tensor_t &, TMVA::DNN::TCudnn<AFloat>::Tensor_t &, const TMVA::DNN::TCudnn<AFloat>::RNNDescriptors_t &, TMVA::DNN::TCudnn<AFloat>::RNNWorkspace_t &, bool) [with AFloat=Float_t]" at line 43 of /build/root/src/root-6.30.06/tmva/tmva/src/DNN/Architectures/Cudnn.cu

/build/root/src/root-6.30.06/tmva/tmva/src/DNN/Architectures/Cudnn/RecurrentPropagation.cu(545): error: identifier "cudnnRNNBackwardData" is undefined
     cudnnStatus_t status = cudnnRNNBackwardData(
                            ^
          detected during instantiation of "void TMVA::DNN::TCudnn<AFloat>::RNNBackward(const TMVA::DNN::TCudnn<AFloat>::Tensor_t &, const TMVA::DNN::TCudnn<AFloat>::Tensor_t &, const TMVA::DNN::TCudnn<AFloat>::Tensor_t &, const TMVA::DNN::TCudnn<AFloat>::Tensor_t &, const TMVA::DNN::TCudnn<AFloat>::Tensor_t &, const TMVA::DNN::TCudnn<AFloat>::Tensor_t &, const TMVA::DNN::TCudnn<AFloat>::Tensor_t &, const TMVA::DNN::TCudnn<AFloat>::Tensor_t &, TMVA::DNN::TCudnn<AFloat>::Tensor_t &, TMVA::DNN::TCudnn<AFloat>::Tensor_t &, TMVA::DNN::TCudnn<AFloat>::Tensor_t &, TMVA::DNN::TCudnn<AFloat>::Tensor_t &, const TMVA::DNN::TCudnn<AFloat>::RNNDescriptors_t &, TMVA::DNN::TCudnn<AFloat>::RNNWorkspace_t &) [with AFloat=Float_t]" at line 43 of /build/root/src/root-6.30.06/tmva/tmva/src/DNN/Architectures/Cudnn.cu

/build/root/src/root-6.30.06/tmva/tmva/src/DNN/Architectures/Cudnn/RecurrentPropagation.cu(571): error: identifier "cudnnRNNBackwardWeights" is undefined
     status = cudnnRNNBackwardWeights(cudnnHandle, rnnDesc, seqLength, desc.xDesc.data(), x.GetDataPointer(),
              ^
          detected during instantiation of "void TMVA::DNN::TCudnn<AFloat>::RNNBackward(const TMVA::DNN::TCudnn<AFloat>::Tensor_t &, const TMVA::DNN::TCudnn<AFloat>::Tensor_t &, const TMVA::DNN::TCudnn<AFloat>::Tensor_t &, const TMVA::DNN::TCudnn<AFloat>::Tensor_t &, const TMVA::DNN::TCudnn<AFloat>::Tensor_t &, const TMVA::DNN::TCudnn<AFloat>::Tensor_t &, const TMVA::DNN::TCudnn<AFloat>::Tensor_t &, const TMVA::DNN::TCudnn<AFloat>::Tensor_t &, TMVA::DNN::TCudnn<AFloat>::Tensor_t &, TMVA::DNN::TCudnn<AFloat>::Tensor_t &, TMVA::DNN::TCudnn<AFloat>::Tensor_t &, TMVA::DNN::TCudnn<AFloat>::Tensor_t &, const TMVA::DNN::TCudnn<AFloat>::RNNDescriptors_t &, TMVA::DNN::TCudnn<AFloat>::RNNWorkspace_t &) [with AFloat=Float_t]" at line 43 of /build/root/src/root-6.30.06/tmva/tmva/src/DNN/Architectures/Cudnn.cu

/build/root/src/root-6.30.06/tmva/tmva/src/DNN/Architectures/Cudnn/RecurrentPropagation.cu(500): error: identifier "cudnnRNNForwardTraining" is undefined
        cudnnStatus_t status = cudnnRNNForwardTraining(
                               ^
          detected during instantiation of "void TMVA::DNN::TCudnn<AFloat>::RNNForward(const TMVA::DNN::TCudnn<AFloat>::Tensor_t &, const TMVA::DNN::TCudnn<AFloat>::Tensor_t &, const TMVA::DNN::TCudnn<AFloat>::Tensor_t &, const TMVA::DNN::TCudnn<AFloat>::Tensor_t &, TMVA::DNN::TCudnn<AFloat>::Tensor_t &, TMVA::DNN::TCudnn<AFloat>::Tensor_t &, TMVA::DNN::TCudnn<AFloat>::Tensor_t &, const TMVA::DNN::TCudnn<AFloat>::RNNDescriptors_t &, TMVA::DNN::TCudnn<AFloat>::RNNWorkspace_t &, bool) [with AFloat=Double_t]" at line 44 of /build/root/src/root-6.30.06/tmva/tmva/src/DNN/Architectures/Cudnn.cu

/build/root/src/root-6.30.06/tmva/tmva/src/DNN/Architectures/Cudnn/RecurrentPropagation.cu(513): error: identifier "cudnnRNNForwardInference" is undefined
        cudnnStatus_t status = cudnnRNNForwardInference(
                               ^
          detected during instantiation of "void TMVA::DNN::TCudnn<AFloat>::RNNForward(const TMVA::DNN::TCudnn<AFloat>::Tensor_t &, const TMVA::DNN::TCudnn<AFloat>::Tensor_t &, const TMVA::DNN::TCudnn<AFloat>::Tensor_t &, const TMVA::DNN::TCudnn<AFloat>::Tensor_t &, TMVA::DNN::TCudnn<AFloat>::Tensor_t &, TMVA::DNN::TCudnn<AFloat>::Tensor_t &, TMVA::DNN::TCudnn<AFloat>::Tensor_t &, const TMVA::DNN::TCudnn<AFloat>::RNNDescriptors_t &, TMVA::DNN::TCudnn<AFloat>::RNNWorkspace_t &, bool) [with AFloat=Double_t]" at line 44 of /build/root/src/root-6.30.06/tmva/tmva/src/DNN/Architectures/Cudnn.cu

/build/root/src/root-6.30.06/tmva/tmva/src/DNN/Architectures/Cudnn/RecurrentPropagation.cu(545): error: identifier "cudnnRNNBackwardData" is undefined
     cudnnStatus_t status = cudnnRNNBackwardData(
                            ^
          detected during instantiation of "void TMVA::DNN::TCudnn<AFloat>::RNNBackward(const TMVA::DNN::TCudnn<AFloat>::Tensor_t &, const TMVA::DNN::TCudnn<AFloat>::Tensor_t &, const TMVA::DNN::TCudnn<AFloat>::Tensor_t &, const TMVA::DNN::TCudnn<AFloat>::Tensor_t &, const TMVA::DNN::TCudnn<AFloat>::Tensor_t &, const TMVA::DNN::TCudnn<AFloat>::Tensor_t &, const TMVA::DNN::TCudnn<AFloat>::Tensor_t &, const TMVA::DNN::TCudnn<AFloat>::Tensor_t &, TMVA::DNN::TCudnn<AFloat>::Tensor_t &, TMVA::DNN::TCudnn<AFloat>::Tensor_t &, TMVA::DNN::TCudnn<AFloat>::Tensor_t &, TMVA::DNN::TCudnn<AFloat>::Tensor_t &, const TMVA::DNN::TCudnn<AFloat>::RNNDescriptors_t &, TMVA::DNN::TCudnn<AFloat>::RNNWorkspace_t &) [with AFloat=Double_t]" at line 44 of /build/root/src/root-6.30.06/tmva/tmva/src/DNN/Architectures/Cudnn.cu

/build/root/src/root-6.30.06/tmva/tmva/src/DNN/Architectures/Cudnn/RecurrentPropagation.cu(571): error: identifier "cudnnRNNBackwardWeights" is undefined
     status = cudnnRNNBackwardWeights(cudnnHandle, rnnDesc, seqLength, desc.xDesc.data(), x.GetDataPointer(),
              ^
          detected during instantiation of "void TMVA::DNN::TCudnn<AFloat>::RNNBackward(const TMVA::DNN::TCudnn<AFloat>::Tensor_t &, const TMVA::DNN::TCudnn<AFloat>::Tensor_t &, const TMVA::DNN::TCudnn<AFloat>::Tensor_t &, const TMVA::DNN::TCudnn<AFloat>::Tensor_t &, const TMVA::DNN::TCudnn<AFloat>::Tensor_t &, const TMVA::DNN::TCudnn<AFloat>::Tensor_t &, const TMVA::DNN::TCudnn<AFloat>::Tensor_t &, const TMVA::DNN::TCudnn<AFloat>::Tensor_t &, TMVA::DNN::TCudnn<AFloat>::Tensor_t &, TMVA::DNN::TCudnn<AFloat>::Tensor_t &, TMVA::DNN::TCudnn<AFloat>::Tensor_t &, TMVA::DNN::TCudnn<AFloat>::Tensor_t &, const TMVA::DNN::TCudnn<AFloat>::RNNDescriptors_t &, TMVA::DNN::TCudnn<AFloat>::RNNWorkspace_t &) [with AFloat=Double_t]" at line 44 of /build/root/src/root-6.30.06/tmva/tmva/src/DNN/Architectures/Cudnn.cu

8 errors detected in the compilation of "/build/root/src/root-6.30.06/tmva/tmva/src/DNN/Architectures/Cudnn.cu".

The missing functions were deprecated in cuDNN 8.0 and removed in cuDNN 9.0.

Reproducer

Build from source with cuDNN 9.0 or newer.

ROOT version

6.30.06

Installation method

build from source

Operating system

Arch Linux

Additional context

No response

lahwaacz avatar May 04 '24 21:05 lahwaacz

Hi @dpiparo @lmoneta

Can this still be considered for 6.32? Would be nice for the LCG stacks if we could go to the latest cudnn with cuda 12.4

andresailer avatar May 16 '24 07:05 andresailer

Hi all! To assess the situation, I tried to build ROOT with CUDNN 9.0 myself, and it is actually a huge interface change!

I wouldn't recommend to anyone to do this migration without the help of CI tests, which we don't have for anything CUDA-related.

Just for reference, the previous migration to CUDNN 8.0 wasn't done by a core ROOT developer but indeed generously by the Arch package maintainer @kgizdov in 2020: https://github.com/root-project/root/pull/6058 Of the 3350 lines of code in tmva/tmva/src/DNN/Architectures/Cudnn, a significant fraction had to be changed.

Therefore, we need to have a discussion: should cudnn even be enabled in any build of ROOT?

I have a few more data points, besides the observation that it's only packagers that seem to care about cudnn=ON:

  • All questions about "cudnn" on the forum are about build problems, not actual usage: https://root-forum.cern.ch/search?q=cudnn
  • On indico, it also doesn't seem like it's used much: https://indico.cern.ch/search/?q=cudnn&sort=mostrecent
  • There is only one presentation about this work (a summer student talk)

For 3350 lines of code in ROOT where we don't know if they are used, the support burden is very high.

IMHO, you, @andresailer and @lahwaacz should consider going for cudnn=OFF, and we should only continue to invest in this ROOT component once an actual user complains about its absence either here on GitHub or on the forum.

@lmoneta and @dpiparo, what is your opinion?

guitargeek avatar May 22 '24 00:05 guitargeek

Hi @guitargeek ,

There are these proceedings that talk about cuDNN and TMVA as well. https://www.epj-conferences.org/articles/epjconf/pdf/2020/21/epjconf_chep2020_06019.pdf

andresailer avatar May 22 '24 08:05 andresailer

@guitargeek : I will soon open a PR adding this migration.

lmoneta avatar May 22 '24 08:05 lmoneta

master done, 6.32 PR submitted, tests running https://github.com/root-project/root/pull/15636

dpiparo avatar May 24 '24 14:05 dpiparo

@andresailer

dpiparo avatar May 24 '24 14:05 dpiparo