[Build] CUDNN on Windows build fails due to wrong directory name assumption
Describe the issue
cmake/CMakeLists.txt near line 1015 wrongly assumes libraries for CuDNN on Windows will be in a /lib/x64 style path. This leads to build failure as cudnn.lib can not be found.
This wrong assumption is both against official NVidia CuDNN installation instructions https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html#installwindows and is not as the paths exist in distributions like cuDNN 8.4.1.50 Windows ZIP download
Easy fix in forthcoming PR.
Urgency
No response
Target platform
Windows
Build script
.\build.bat --update --build --skip_tests --cmake_generator "Visual Studio 16 2019" --config RelWithDebInfo --build_shared_lib --parallel --use_dml --use_cuda --cuda_home "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.4" --cuda_version 11.4 --cudnn_home "C:\repos-nobackup\cudnn-windows-x86_64-8.4.1.50_cuda11.6-archive" --use_tensorrt --tensorrt_home "C:\repos-nobackup\TensorRT-8.4.1.5"
Error / output
absl_cord.vcxproj -> C:\repos-nobackup\onnxruntime\build\Windows\RelWithDebInfo\external\abseil-cpp\absl\strings\RelW
ithDebInfo\absl_cord.lib
nvonnxparser_static.vcxproj -> C:\repos-nobackup\onnxruntime\build\Windows\RelWithDebInfo\external\onnx-tensorrt\RelW
ithDebInfo\nvonnxparser_static.lib
onnx.vcxproj -> C:\repos-nobackup\onnxruntime\build\Windows\RelWithDebInfo\external\onnx\RelWithDebInfo\onnx.lib
LINK : fatal error LNK1104: cannot open file 'cudnn.lib' [C:\repos-nobackup\onnxruntime\build\Windows\RelWithDebInfo\on
nxruntime_providers_tensorrt.vcxproj]
test_execution_provider.vcxproj -> C:\repos-nobackup\onnxruntime\build\Windows\RelWithDebInfo\RelWithDebInfo\test_exe
cution_provider.dll
LINK : fatal error LNK1104: cannot open file 'cudnn.lib' [C:\repos-nobackup\onnxruntime\build\Windows\RelWithDebInfo\on
nxruntime_providers_cuda.vcxproj]
Traceback (most recent call last):
File "C:\repos-nobackup\onnxruntime\tools\ci_build\build.py", line 2744, in <module>
sys.exit(main())
File "C:\repos-nobackup\onnxruntime\tools\ci_build\build.py", line 2663, in main
build_targets(args, cmake_path, build_dir, configs, num_parallel_jobs, args.target)
File "C:\repos-nobackup\onnxruntime\tools\ci_build\build.py", line 1301, in build_targets
run_subprocess(cmd_args, env=env)
File "C:\repos-nobackup\onnxruntime\tools\ci_build\build.py", line 714, in run_subprocess
return run(*args, cwd=cwd, capture_stdout=capture_stdout, shell=shell, env=my_env)
File "C:\repos-nobackup\onnxruntime\tools\python\util\run.py", line 49, in run
completed_process = subprocess.run(
File "C:\python39\lib\subprocess.py", line 528, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['C:\\Program Files\\CMake\\bin\\cmake.EXE', '--build', 'C:\\repos-nobackup\\onnxruntime\\\\build\\Windows\\RelWithDebInfo', '--config', 'RelWithDebInfo', '--', '/maxcpucount:16', '/nodeReuse:False']' returned non-zero exit status 1.
Visual Studio Version
16.11.18
GCC / Compiler Version
No response
The underlying issue is that the installation instructions and folder setup for cudnn keep changing. The ORT cmake file needs to handle this somewhat insane number of combinations, as depending on when the user followed the cudnn install instructions any could be 'valid'.
In August it changed to the source folder being 'lib'
Previously the instructions were to copy the 'lib/x64' folder
https://web.archive.org/web/20220724181728/https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html
And earlier in the year it involved copying to the CUDA install directory rather than a cuDNN directory
https://web.archive.org/web/20220129092717/https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html
Got it. I saw the #TODO: combine onnxruntime_CUDNN_HOME and onnxruntime_CUDA_HOME, assume they are the same comment a few line above my fix. I discourage that (and recommend removing that TODO) as Windows package managers and the most recent two methods you list above don't have cuda and cudnn in shared folders.
From what I can see at the moment, Windows is the pain point and the libraries are one of two directories. My PR adds both those possibilities. And aborting the TODO can deal with the shared or not.
My alternative was to probe with IF(EXISTS) and a series of if/else to use the path it finds cudnn.lib. Could use file(GLOB_RECURSE ... cudnn.lib) and use that path. Or go fancy and burn devtime using/writing a cmake find module for cudnn. ;-)