rules_cuda icon indicating copy to clipboard operation
rules_cuda copied to clipboard

[BUG] linking failed using custom cc_toolchain with platform_cpu, cannot detect multiplatform cuda library.

Open ZhenshengLee opened this issue 1 year ago • 3 comments

brief

NOTE: in the default platform, which is x86_64(k8) toolchain , the compile and linking works. I wonder if it's a bug or just a misconfiguration during usage of this repo?

environment

bazel: version7.0.2 cctoolchain: //bazel/toolchains/v5l (a custom cc toolchain for cross compile in aarch64, like https://github.com/f0rmiga/gcc-toolchain/blob/main/toolchain/cc_toolchain_config.bzl)

├── toolchains
│   └── v5l
│       ├── BUILD
│       ├── v5l.BUILD
│       └── v5l_cc_toolchain_config.bzl

repro steps

simply compile the basic example with cu_library and report the following errors. NOTE: in the default platform, which is x86_64(k8) toolchain , the compile works.

cc_binary(
    name = "module_cuda_main",
    srcs = ["tool/module_cuda_main.cpp"],
    includes = ["include"],
    tags = ["tool"],
    visibility = ["//main:__pkg__"],
    deps = [
        ":module_cuda"
    ]
(03:11:14) INFO: Current date is 2024-08-08
(03:11:14) INFO: Analyzed 323 targets (0 packages loaded, 21 targets configured).
(03:11:14) ERROR: /gw_demo/modules/team_demo/module_demo/BUILD:57:15: Linking modules/team_demo/module_demo/module_cuda_main failed: (Exit 1): aarch64-buildroot-linux-gnu-gcc failed: error executing CppLink command (from target //modules/team_demo/module_demo:module_cuda_main) 
  (cd /home/zs/.cache/bazel/_bazel_zs/2c098eac6c684e1fabebb74f5f4483bd/execroot/gaos && \
  exec env - \
    LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/lib:/usr/lib/x86_64-linux-gnu:/opt/rti.com/rti_connext_dds-6.0.1/lib/x64Linux4gcc7.3.0:/opt/ros/humble/opt/rviz_ogre_vendor/lib:/opt/ros/humble/lib/x86_64-linux-gnu:/opt/ros/humble/lib \
    PATH=/usr/local/cuda/bin:/opt/rti.com/rti_connext_dds-6.0.1/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/ros/humble/bin \
    PWD=/proc/self/cwd \
  external/v5l_cc_toolchain/bin/aarch64-buildroot-linux-gnu-gcc -o bazel-out/aarch64-dbg/bin/modules/team_demo/module_demo/module_cuda_main -Xlinker -rpath -Xlinker '$ORIGIN/../../../_solib_aarch64-buildroot-linux-gnu/_U@@local_Ucuda_S_S_Clibcudart.so___Ucuda_Slib64' -Xlinker -rpath -Xlinker '$ORIGIN/module_cuda_main.runfiles/gaos/_solib_aarch64-buildroot-linux-gnu/_U@@local_Ucuda_S_S_Clibcudart.so___Ucuda_Slib64' -Xlinker -rpath -Xlinker '$ORIGIN/../../../_solib_aarch64-buildroot-linux-gnu/_U@@local_Ucuda_S_S_Clibcudart.so.11.0___Ucuda_Slib64' -Xlinker -rpath -Xlinker '$ORIGIN/module_cuda_main.runfiles/gaos/_solib_aarch64-buildroot-linux-gnu/_U@@local_Ucuda_S_S_Clibcudart.so.11.0___Ucuda_Slib64' -Xlinker -rpath -Xlinker '$ORIGIN/../../../_solib_aarch64-buildroot-linux-gnu/_U@@local_Ucuda_S_S_Clibcudart.so.11.4.409___Ucuda_Slib64' -Xlinker -rpath -Xlinker '$ORIGIN/module_cuda_main.runfiles/gaos/_solib_aarch64-buildroot-linux-gnu/_U@@local_Ucuda_S_S_Clibcudart.so.11.4.409___Ucuda_Slib64' -Xlinker -rpath -Xlinker '$ORIGIN/../../../_solib_aarch64-buildroot-linux-gnu/_U@@cuda_Uaarch64_Ulinux_S_S_Ccudart___Ulib' -Xlinker -rpath -Xlinker '$ORIGIN/module_cuda_main.runfiles/gaos/_solib_aarch64-buildroot-linux-gnu/_U@@cuda_Uaarch64_Ulinux_S_S_Ccudart___Ulib' -Xlinker -rpath -Xlinker '$ORIGIN/../../../_solib_aarch64-buildroot-linux-gnu/_U@@cuda_Uaarch64_Ulinux_S_S_Ccuda___Ulib_Sstubs' -Xlinker -rpath -Xlinker '$ORIGIN/module_cuda_main.runfiles/gaos/_solib_aarch64-buildroot-linux-gnu/_U@@cuda_Uaarch64_Ulinux_S_S_Ccuda___Ulib_Sstubs' -Lbazel-out/aarch64-dbg/bin/_solib_aarch64-buildroot-linux-gnu/_U@@local_Ucuda_S_S_Clibcudart.so___Ucuda_Slib64 -Lbazel-out/aarch64-dbg/bin/_solib_aarch64-buildroot-linux-gnu/_U@@local_Ucuda_S_S_Clibcudart.so.11.0___Ucuda_Slib64 -Lbazel-out/aarch64-dbg/bin/_solib_aarch64-buildroot-linux-gnu/_U@@local_Ucuda_S_S_Clibcudart.so.11.4.409___Ucuda_Slib64 -Lbazel-out/aarch64-dbg/bin/_solib_aarch64-buildroot-linux-gnu/_U@@cuda_Uaarch64_Ulinux_S_S_Ccudart___Ulib -Lbazel-out/aarch64-dbg/bin/_solib_aarch64-buildroot-linux-gnu/_U@@cuda_Uaarch64_Ulinux_S_S_Ccuda___Ulib_Sstubs bazel-out/aarch64-dbg/bin/modules/team_demo/module_demo/_objs/module_cuda_main/module_cuda_main.pic.o bazel-out/aarch64-dbg/bin/modules/team_demo/module_demo/libmodule_cuda.a external/local_cuda/cuda/lib64/libcudadevrt.a -lcudart -l:libcudart.so.11.0 -l:libcudart.so.11.4.409 -lcudart -lcuda -pie -ldl -lpthread -lrt -Wl,-rpath,lib/ -L/drive/drive-linux/lib-target/ -L/drive/drive-linux/filesystem/targetfs/usr/lib/aarch64-linux-gnu/ -Wl,-rpath-link,/drive/drive-linux/filesystem/targetfs/usr/lib/aarch64-linux-gnu/ -Wl,-rpath-link,/drive/drive-linux/lib-target/ -Wl,-rpath-link,/usr/lib/aarch64-linux-gnu -Wl,-rpath-link,/usr/aarch64-linux-gnu -Wl,-rpath-link,/drive/drive-linux/filesystem/targetfs/lib/aarch64-linux-gnu -lgcov -lstdc++ -no-canonical-prefixes)
# Configuration: 93bfd7653555f545157f5fbb9812135069a379b953233ab0eef19c8f88c3340d
# Execution platform: @@local_config_platform//:host
/drive/toolchains/aarch64--glibc--stable-2022.03-1/bin/../lib/gcc/aarch64-buildroot-linux-gnu/9.3.0/../../../../aarch64-buildroot-linux-gnu/bin/ld: skipping incompatible bazel-out/aarch64-dbg/bin/_solib_aarch64-buildroot-linux-gnu/_U@@local_Ucuda_S_S_Clibcudart.so___Ucuda_Slib64/libcudart.so when searching for -lcudart
/drive/toolchains/aarch64--glibc--stable-2022.03-1/bin/../lib/gcc/aarch64-buildroot-linux-gnu/9.3.0/../../../../aarch64-buildroot-linux-gnu/bin/ld: skipping incompatible bazel-out/aarch64-dbg/bin/_solib_aarch64-buildroot-linux-gnu/_U@@local_Ucuda_S_S_Clibcudart.so.11.0___Ucuda_Slib64/libcudart.so.11.0 when searching for -l:libcudart.so.11.0
/drive/toolchains/aarch64--glibc--stable-2022.03-1/bin/../lib/gcc/aarch64-buildroot-linux-gnu/9.3.0/../../../../aarch64-buildroot-linux-gnu/bin/ld: skipping incompatible bazel-out/aarch64-dbg/bin/_solib_aarch64-buildroot-linux-gnu/_U@@local_Ucuda_S_S_Clibcudart.so.11.0___Ucuda_Slib64/libcudart.so.11.0 when searching for -l:libcudart.so.11.0
/drive/toolchains/aarch64--glibc--stable-2022.03-1/bin/../lib/gcc/aarch64-buildroot-linux-gnu/9.3.0/../../../../aarch64-buildroot-linux-gnu/bin/ld: cannot find -l:libcudart.so.11.0
/drive/toolchains/aarch64--glibc--stable-2022.03-1/bin/../lib/gcc/aarch64-buildroot-linux-gnu/9.3.0/../../../../aarch64-buildroot-linux-gnu/bin/ld: skipping incompatible bazel-out/aarch64-dbg/bin/_solib_aarch64-buildroot-linux-gnu/_U@@local_Ucuda_S_S_Clibcudart.so.11.4.409___Ucuda_Slib64/libcudart.so.11.4.409 when searching for -l:libcudart.so.11.4.409
/drive/toolchains/aarch64--glibc--stable-2022.03-1/bin/../lib/gcc/aarch64-buildroot-linux-gnu/9.3.0/../../../../aarch64-buildroot-linux-gnu/bin/ld: skipping incompatible bazel-out/aarch64-dbg/bin/_solib_aarch64-buildroot-linux-gnu/_U@@local_Ucuda_S_S_Clibcudart.so.11.4.409___Ucuda_Slib64/libcudart.so.11.4.409 when searching for -l:libcudart.so.11.4.409
/drive/toolchains/aarch64--glibc--stable-2022.03-1/bin/../lib/gcc/aarch64-buildroot-linux-gnu/9.3.0/../../../../aarch64-buildroot-linux-gnu/bin/ld: cannot find -l:libcudart.so.11.4.409
/drive/toolchains/aarch64--glibc--stable-2022.03-1/bin/../lib/gcc/aarch64-buildroot-linux-gnu/9.3.0/../../../../aarch64-buildroot-linux-gnu/bin/ld: skipping incompatible bazel-out/aarch64-dbg/bin/_solib_aarch64-buildroot-linux-gnu/_U@@local_Ucuda_S_S_Clibcudart.so___Ucuda_Slib64/libcudart.so when searching for -lcudart
collect2: error: ld returned 1 exit status
(03:11:14) INFO: Elapsed time: 0.602s, Critical Path: 0.08s
(03:11:14) INFO: 2 processes: 2 internal.
(03:11:14) ERROR: Build did NOT complete successfully

considerations

skipping incompatible bazel-out/aarch64-dbg/bin/_solib_aarch64-buildroot-linux-gnu/_U@@local_Ucuda_S_S_Clibcudart.so.11.0___Ucuda_Slib64/libcudart.so.11.0 when searching for -l:libcudart.so.11.0

which means the cuda libraries is still the lib64 version in usr/local/cuda/lib64.

actually the cuda libraries may be installed in other dirs and may consist multiple arch version. especially in nvidia AGX machines.

  • CUDA: Should be installed at /usr/local, CUDA for various platforms should be in the target directory of /usr/local/cuda-X
  • e.g. aarch64-linux CUDA 10.1 should be located at /usr/local/cuda-10.1/targets/aarch64-linux
  • CUDA-X DL Libs (i.e. TensorRT and cuDNN): Should be located at /usr/local/cuda-X/dl/targets/<PLATFORM>/{include, lib}
  • Other system dependencies: Dependencies should be located in /usr/local/{include, lib} for x86_64, /usr/aarch64-linux-gnu/ for aarch64-linux and /usr/aarch64-unknown-nto-qnx/aarch64le for aarch64-qnx

https://github.com/NVIDIA/DL4AGX/blob/9a4f60c2847d32e81372b9a2165299a3b65eabf1/CONTRIBUTING.md?plain=1#L201-L205

related info

there is an old version of cuda toolchain config which supports multiplatform_cpu compile in bazel, but the CROSSTOOL is outdated and not available in the latest version of bazel.

https://github.com/NVIDIA/DL4AGX/tree/master

EDIT: there already has an issue talking about resolving multiple version of cuda libraries, but I don't think the issue resolved by design https://github.com/bazel-contrib/rules_cuda/issues/113

workaround(works)

add the library path manually should compile the binary successfully.

linkopts = [
        "-L/usr/local/cuda/targets/aarch64-linux/lib",
    ],

ZhenshengLee avatar Aug 08 '24 03:08 ZhenshengLee

I've found that in the doc page

rules_cuda_dependencies(toolkit_path) Populate the dependencies for rules_cuda. This will setup workspace dependencies (other bazel rules) and local toolchains. Name Description Default Value toolkit_path Optionally specify the path to CUDA toolkit. If not specified, it will be detected automatically.

Is there an example to show how to use it correctly?

ZhenshengLee avatar Aug 08 '24 12:08 ZhenshengLee

I think you are configuring it correctly. The root problem is the cross compiling is not addressed in this rule at the moment. So exec_compatible_with for tools and target_compatible_with for runtime are assumed to be the same, but they are not enforced so it is workaroundable.

cloudhan avatar Aug 08 '24 13:08 cloudhan

I think you are configuring it correctly. The root problem is the cross compiling is not addressed in this rule at the moment.

OK, I will keep the issue open.

ZhenshengLee avatar Aug 09 '24 01:08 ZhenshengLee