onnxruntime [Build] CUDNN on Windows build fails due to wrong directory name assumption

Describe the issue

cmake/CMakeLists.txt near line 1015 wrongly assumes libraries for CuDNN on Windows will be in a /lib/x64 style path. This leads to build failure as cudnn.lib can not be found.

This wrong assumption is both against official NVidia CuDNN installation instructions https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html#installwindows and is not as the paths exist in distributions like cuDNN 8.4.1.50 Windows ZIP download

Easy fix in forthcoming PR.

Urgency

No response

Target platform

Windows

Build script

.\build.bat --update --build --skip_tests --cmake_generator "Visual Studio 16 2019" --config RelWithDebInfo --build_shared_lib --parallel --use_dml --use_cuda --cuda_home "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.4" --cuda_version 11.4 --cudnn_home "C:\repos-nobackup\cudnn-windows-x86_64-8.4.1.50_cuda11.6-archive" --use_tensorrt --tensorrt_home "C:\repos-nobackup\TensorRT-8.4.1.5"

Error / output

absl_cord.vcxproj -> C:\repos-nobackup\onnxruntime\build\Windows\RelWithDebInfo\external\abseil-cpp\absl\strings\RelW
  ithDebInfo\absl_cord.lib
  nvonnxparser_static.vcxproj -> C:\repos-nobackup\onnxruntime\build\Windows\RelWithDebInfo\external\onnx-tensorrt\RelW
  ithDebInfo\nvonnxparser_static.lib
  onnx.vcxproj -> C:\repos-nobackup\onnxruntime\build\Windows\RelWithDebInfo\external\onnx\RelWithDebInfo\onnx.lib
LINK : fatal error LNK1104: cannot open file 'cudnn.lib' [C:\repos-nobackup\onnxruntime\build\Windows\RelWithDebInfo\on
nxruntime_providers_tensorrt.vcxproj]
  test_execution_provider.vcxproj -> C:\repos-nobackup\onnxruntime\build\Windows\RelWithDebInfo\RelWithDebInfo\test_exe
  cution_provider.dll
LINK : fatal error LNK1104: cannot open file 'cudnn.lib' [C:\repos-nobackup\onnxruntime\build\Windows\RelWithDebInfo\on
nxruntime_providers_cuda.vcxproj]
Traceback (most recent call last):
  File "C:\repos-nobackup\onnxruntime\tools\ci_build\build.py", line 2744, in <module>
    sys.exit(main())
  File "C:\repos-nobackup\onnxruntime\tools\ci_build\build.py", line 2663, in main
    build_targets(args, cmake_path, build_dir, configs, num_parallel_jobs, args.target)
  File "C:\repos-nobackup\onnxruntime\tools\ci_build\build.py", line 1301, in build_targets
    run_subprocess(cmd_args, env=env)
  File "C:\repos-nobackup\onnxruntime\tools\ci_build\build.py", line 714, in run_subprocess
    return run(*args, cwd=cwd, capture_stdout=capture_stdout, shell=shell, env=my_env)
  File "C:\repos-nobackup\onnxruntime\tools\python\util\run.py", line 49, in run
    completed_process = subprocess.run(
  File "C:\python39\lib\subprocess.py", line 528, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['C:\\Program Files\\CMake\\bin\\cmake.EXE', '--build', 'C:\\repos-nobackup\\onnxruntime\\\\build\\Windows\\RelWithDebInfo', '--config', 'RelWithDebInfo', '--', '/maxcpucount:16', '/nodeReuse:False']' returned non-zero exit status 1.

Visual Studio Version

16.11.18

GCC / Compiler Version

No response

Sep 15 '22 02:09 diablodale

The underlying issue is that the installation instructions and folder setup for cudnn keep changing. The ORT cmake file needs to handle this somewhat insane number of combinations, as depending on when the user followed the cudnn install instructions any could be 'valid'.

In August it changed to the source folder being 'lib'

Previously the instructions were to copy the 'lib/x64' folder

https://web.archive.org/web/20220724181728/https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html

And earlier in the year it involved copying to the CUDA install directory rather than a cuDNN directory

https://web.archive.org/web/20220129092717/https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html

Sep 15 '22 08:09 skottmckay

Got it. I saw the #TODO: combine onnxruntime_CUDNN_HOME and onnxruntime_CUDA_HOME, assume they are the same comment a few line above my fix. I discourage that (and recommend removing that TODO) as Windows package managers and the most recent two methods you list above don't have cuda and cudnn in shared folders.

From what I can see at the moment, Windows is the pain point and the libraries are one of two directories. My PR adds both those possibilities. And aborting the TODO can deal with the shared or not.

My alternative was to probe with IF(EXISTS) and a series of if/else to use the path it finds cudnn.lib. Could use file(GLOB_RECURSE ... cudnn.lib) and use that path. Or go fancy and burn devtime using/writing a cmake find module for cudnn. ;-)

Sep 15 '22 08:09 diablodale