rules_cuda icon indicating copy to clipboard operation
rules_cuda copied to clipboard

Using custom thrust repo

Open garymm opened this issue 2 years ago • 14 comments
trafficstars

I'm trying to use the latest Thrust release and running into issues. I can reproduce this in the examples of this repo. See here: https://github.com/garymm/rules_cuda/tree/0b47488/examples

It seems that for some reason the CUDA toolkit's thrust is used, even though the -isystem for my custom thrust comes first on the nvcc command.

To reproduce:

# First, install any CUDA toolkit at or less than version 12.1
# Then
git clone https://github.com/garymm/rules_cuda.git
cd rules_cuda
git switch garymm/custom-thrust
cd examples
bazel build -s //thrust:version_test

produces:

➜  bazel build -s //thrust:version_test                
SUBCOMMAND: # //thrust:version_test_cu [action 'Compiling thrust/version_test.cu', configuration: c92f7800966ee88d4403d82e9237650892518949abe87c60f2a9a688842ef3b8, execution platform: @local_config_platform//:host]
(cd /home/garymm/.cache/bazel/_bazel_garymm/40b8e895884343ff3f99cfc29b0e57bd/execroot/rules_cuda_examples && \
  exec env - \
    PATH=/usr/bin \
  /usr/local/cuda/bin/nvcc -x cu -Xcompiler -fPIC -ccbin /usr/bin/gcc -I . -I bazel-out/k8-fastbuild/bin -I external/thrust -I bazel-out/k8-fastbuild/bin/external/thrust -I external/cub -I bazel-out/k8-fastbuild/bin/external/cub -I external/local_cuda -I bazel-out/k8-fastbuild/bin/external/local_cuda -isystem external/thrust -isystem bazel-out/k8-fastbuild/bin/external/thrust -isystem external/cub -isystem bazel-out/k8-fastbuild/bin/external/cub -isystem external/local_cuda/cuda/include -isystem bazel-out/k8-fastbuild/bin/external/local_cuda/cuda/include -Xcompiler -g1 -c thrust/version_test.cu -o bazel-out/k8-fastbuild/bin/thrust/_objs/version_test_cu/version_test.pic.o --expt-relaxed-constexpr --expt-extended-lambda)
# Configuration: c92f7800966ee88d4403d82e9237650892518949abe87c60f2a9a688842ef3b8
# Execution platform: @local_config_platform//:host
ERROR: /home/garymm/src/bazel-contrib/rules_cuda/examples/thrust/BUILD.bazel:17:13: Compiling thrust/version_test.cu failed: (Exit 2): nvcc failed: error executing command (from target //thrust:version_test_cu) /usr/local/cuda/bin/nvcc -x cu -Xcompiler -fPIC -ccbin /usr/bin/gcc -I . -I bazel-out/k8-fastbuild/bin -I external/thrust -I bazel-out/k8-fastbuild/bin/external/thrust -I external/cub -I ... (remaining 25 arguments skipped)

Use --sandbox_debug to see verbose messages from the sandbox and retain the sandbox build root for debugging
thrust/version_test.cu(4): error: static assertion failed
  static_assert((200001 / 100 % 1000) == 1);```

This is on Ubuntu, with CUDA 12.1 installed here:

➜  ls /usr/local/cuda*
/usr/local/cuda@  /usr/local/cuda-12@

/usr/local/cuda-12.1

garymm avatar Jun 02 '23 00:06 garymm

See https://github.com/bazel-contrib/rules_cuda/pull/38#issuecomment-1303892768.

Won't fix if you want to upgrade the bundled version to newer version. In that case, you need to roll out your own thrust target with repository rule.

cloudhan avatar Jun 02 '23 04:06 cloudhan

You can refer to https://github.com/bazel-contrib/rules_cuda/blob/cf17ea98fc73314d18cd051a848e7c9df0a97325/third_party/thrust.BUILD as an example, without forking this repo.

cloudhan avatar Jun 02 '23 04:06 cloudhan

The CHAR_MIN issue seems to come from missing climits header. https://en.cppreference.com/w/cpp/types/climits

cloudhan avatar Jun 02 '23 04:06 cloudhan

The root cause is https://github.com/garymm/rules_cuda/blob/d66063672a7e3ea397f40a841b2411e0e4d1ef11/examples/WORKSPACE.bazel#LL33C11-L33C11 in your code. And this file screwed everything https://github.com/NVIDIA/thrust/blob/main/thrust/limits.h.

You owe me a cup of coffee for debugging this. 😈

cloudhan avatar Jun 02 '23 05:06 cloudhan

Removing includes does indeed fix it. I'm a little confused as to why users of the library are able to use system-include syntax without it though. I.e. #include <thrust/device_vector.h>. I thought the angle-bracket includes required -isystem and -isystem is only set if the cc_library has includes set?

garymm avatar Jun 02 '23 17:06 garymm

Because there is <cuda_tookit_root>/include/thrust. The system include path of <cuda_tookit_root>/include is transitively added due to deps chain @thrust -> @cub -> @rules_cuda//:cuda -> @rules_cuda//:cuda_headers

cloudhan avatar Jun 02 '23 17:06 cloudhan

If I change it to includes = [""], which I think is correct, the toolchain's local_cuda includes clobber my custom thrust's includes.

You can see the behavior here: https://github.com/garymm/rules_cuda/tree/3b973fa/examples

So I think @rules_cuda//:cuda_headers should not include the thrust and cub headers. Would you be open to a PR removing them from the cuda_headers target?

garymm avatar Jun 02 '23 17:06 garymm

Reasonable. Would you like to contribute a PR for this? All you need to change is exclude thrust and cub from the glob and add two seperate libraries for cub and thrust.

https://github.com/bazel-contrib/rules_cuda/blob/00a1c59013464bcb3f643a4751b3859428c29659/cuda/runtime/BUILD.local_cuda#L20-L30

cloudhan avatar Jun 02 '23 17:06 cloudhan

I just tried that and it doesn't work for some reason... will investigate.

garymm avatar Jun 02 '23 17:06 garymm

OK, this will be tricky because the cuda toolkit include will be symlinked by bazel into the sandbox. Then all sub directory will be accessible...

cloudhan avatar Jun 02 '23 18:06 cloudhan

I updated the issue description with the latest status. I'm really not sure what's going wrong. I tried using the -M and -H options for the compiler, and I also tried passing --generate-dependencies-with-compile to nvcc, and in all cases it seems to suggest that my custom thrust's header should be the on that's included, but the actually compiled code uses the CUDA toolkit bundled version.

When I run the exact command printed out by bazel build -s, it seems to work. So maybe there is some environment variable or something that is set that is causing the include directory of the toolkit to be searched in a different order than what is suggested on the command line?

garymm avatar Jun 02 '23 21:06 garymm

Hmm, indeed there's something about the sandbox that changes the behavior. Building with --sandbox_strategy=local works.

garymm avatar Jun 02 '23 21:06 garymm

And for some reason removing includes = ["."] from the thrust and cub rules works, even though the resulting change in the nvcc command seems like the opposite of what I would expect to want.

garymm avatar Jun 02 '23 21:06 garymm

I can contribute an example that uses custom thrust and cub with no includes = line and then close this, but I'm a bit at a loss as to why the includes breaks things. If you have some idea please LMK. Or LMK if you want me to just contribute the example and then close this.

garymm avatar Jun 05 '23 23:06 garymm