rules_cuda
rules_cuda copied to clipboard
Support clang as a CUDA compiler
Clang is capable of CUDA compilation these days.
It would be great to add support for using it for CUDA compilation with bazel.
Yeah, there is plan. I have local modification but it doesn't make it now.
@Artem-B Considering you are one of the devs of clang cuda, I have a question about it. I now have a nearly working configuration for clang. The only problem I am facing is, for example:
__global__ void kernel() {
// blahblah with impl
}
with nvcc, the compiled object have symbol
000000000000019a T __device_stub__Z6kernelv()
00000000000002aa T kernel()
but with clang version 15.0.0 (https://github.com/llvm/llvm-project.git 009d56da5c4ea3666c4753ce7564c8c20d7e0255)
0000000000000000 T __device_stub__kernel()
the kernel()
is missing from the object, which is causing my nccl example a linker error for various __global__
functions. In that example a libnccl.so is produced first then link to various perf tests binaries. I am curious what it the root problem here.
nvcc
and clang
do have somewhat different behavior under the hood and you're likely dealing with more than one issue here.
First, kernels and stubs. Clang indeed no longer has kernel
on the host side. https://cuda.godbolt.org/z/833ejYs1c
The main reason for that was to make it possible to distinguish the host-side stub from the actual kernel function in debugger on AMD GPUs. It does mean that CUDA objects compiled with clang will not be able to use <<<...>>>
to launch kernels from objects compiled with nvcc. This is probably not a big deal, as mixing two different compilers in the same build is a bad idea to start with. I do not think that's the root cause of the problem you see.
NCCL does rely on RDC compilation (i.e. each source compiles to a GPU object file, instead of a fully linked GPU executable) and that part works very differently in clang vs nvcc. In a nutshell, object files need an extra final linking step, and a bit of extra host-side 'glue' code. NVCC does that under the hood. Clang does not, yet.
Here's how tensorflow implements cuda_rdc_library
it uses to compile nccl: https://github.com/tensorflow/tensorflow/blob/ee7cf722e4ca9d02b6e62eb3d1b7506ead995422/third_party/nccl/build_defs.bzl.tpl#L261
@jhuber6 has been working on clang driver changes to make compile-to-object-and-link "just work" on the GPU side. E.g. https://github.com/llvm/llvm-project/commit/b7c8c4d8cf07d2e9e8cd157bccc8bd9e7c76415a It's an experimental feature at the moment, but it will make things much simpler, once it's ready.
Also, I believe CMake has recently added support for clang as the CUDA compiler. It may be worth checking whether/how they handle RDC compilation there.
I think the expected way to perform RDC-mode compilation is via the CUDA_SEPARABLE_COMPILATION option. I think this is supported for Clang as well judging by this issue.
Judging by the commit that has implemented it in cmake they did use the same RDC compilation process that we've implemented in tensorflow that I've pointed to above.
You may as well just pick up tensorflow's implementation directly. @chsigg would probably be the most familiar with the details, if you have questions.
OK, problem solved, it turns out that I need compile all C code of nccl as cuda with -x cu
. Otherwise, there will be linker error caused by global functions. Also metioned here.
@Artem-B I think this is addressed in c13ebaada58956435658b18bb91fb46a23534995
Use --@rules_cuda//cuda:compiler=clang
to select compiler, See detect_clang
macro for auto detecting, otherwise, you'll need to bring the toolchain config and registration yourself.
As you are a member of tensorflow, I am wondering if this can be mentioned or evaluated in tf. Might be good for bazel community ;)
@Artem-B I think this is addressed in https://github.com/cloudhan/rules_cuda/commit/c13ebaada58956435658b18bb91fb46a23534995
Do you mean that your build rules are ready to use?
I guess the right sequence to make it all work is to get these rules upstreamed into bazel, then port TF build to it. I'll see what I can do to bring it to the attention of the right people.
Not necessary production ready, but at least usable. It needs more users to test it out before I can say it is production ready. Because it is a build system, there are too many corner cases in it.
One thing that could serve as a motivation to adopt these changes would be to try getting Tensorflow to build using your rules, instead of the ones TF carries. It would be a pretty decent test of the real-world usability of the rules -- TF is probably the largest bazel user outside of Google and is almost certainly the largest user of clang for CUDA compilations. Having them convinced would go a long way towards convincing bazel owners that these rules should to be part of bazel.
Having a proof of concept at that level would also give TF owners rough idea how much work it would take to adopt it and whether it's worth it. One thing to keep in mind is that TF also has to work with our internal build. I don't know yet how hard it would be to switch to your rules. If it's a drop-in replacement of the cuda_library()
implementation, it should be doable. NCCL and other RDC compilation users would need some more work, but it should be manageable, too.
@Artem-B Do we have prebuilt llvm package with NVPTX backend enabled. I'd like adding a building CI. So that I can confidently close this issue finally.
LLVM/Clang releases should have NVPTX built in. E.g https://github.com/llvm/llvm-project/releases/tag/llvmorg-14.0.6
On a side note, just a FYI that there's been a lot of offloading-related changes in clang driver lately that are going to make GPU compilation much closer to C++ compilation. E.g. RDC compilation would "just work" -- clang -c a.cu -o a.o; clang -s b.cu -o b.o; clang -o app a.o b.o
would do the job, with compiler and linker taking care of the GPU-side linking. It's still work in progress, but when it is ready that would help to simplify the build process quite a bit and will make things like GPU-side LTO possible with clang.
If you want to try out the new driver I would appreciate it. For compiling an application in RDC mode you can do the following.
clang++ a.cu b.cu --offload-new-driver -fgpu-rdc --offload-arch=sm_70 -c
clang++ a.o b.o --offload-link -lcudart
Right now what's missing from the new driver is support for textures / surfaces, Windows / MacOS support, and compiling in non-RDC mode. The benefits are simplified compilation, static library support, and LTO among others.
It'd be good to be able to load clang from https://github.com/grailbio/bazel-toolchain so that we can have a hermetic toolchain setup. I'll probably look into this at some point soon as we're already using that toolchain for our host builds and will be using rules_cuda soon within one of our product builds.
llvm apt clang is also built with NVPTX enabled, we can use that too.
This is partially fixed by #143. Later I will add a full integration test by adding nccl as an example. The cloudhan/nccl-example branch should be buildable with both clang and nvcc.
Is there any flags I should add besides maybe those:
build:clang --@rules_cuda//cuda:compiler='clang'
build:clang --@rules_cuda//cuda:archs=compute_61:compute_61,sm_61
build:clang --@rules_cuda//cuda:runtime=@local_cuda//:cuda_runtime_static
build:clang --@rules_cuda//cuda:copts='-stdlib=libc++'
in theory for this to work ? I'm having a weird issue: eveything compiles fine, but then on execution it just dies without any output. Maybe I'm living a bit too close to the edge using clang 17 and CUDA 12.1 ? It does say it's only partially supported...
My whole setup is available here: https://github.com/hypdeb/lawrencium.
Is there any flags I should add besides maybe those:
build:clang --@rules_cuda//cuda:compiler='clang' build:clang --@rules_cuda//cuda:archs=compute_61:compute_61,sm_61 build:clang --@rules_cuda//cuda:runtime=@local_cuda//:cuda_runtime_static build:clang --@rules_cuda//cuda:copts='-stdlib=libc++'
in theory for this to work ? I'm having a weird issue: eveything compiles fine, but then on execution it just dies without any output. Maybe I'm living a bit too close to the edge using clang 17 and CUDA 12.1 ? It does say it's only partially supported...
At least running something like this should be definitely supported. The only time I've seen errors like this in the past is when there's no supported architecture it tends to just silently die. E.g. if I compile for sm_52
but I have an sm_70
card. For executing a basic program I would expect something like the following to work,
$ clang -x cuda cuda.cpp --offload-arch=native -L/opt/cuda/lib -lcudart
If you're using RDC-mode w/ clang you'll need to opt-in.
$ clang -x cuda cuda.cpp --offload-arch=native -L/opt/cuda/lib -lcudart --offload-new-driver -fgpu-rdc
Using native
should auto-detect what card you have installed, it won't work if you're building on a different machine than what you run on.
Thanks for the extremely fast and detailed response. I just tried a few things based on your inputs, but no luck. I should add that I'm working in Ubuntu 22.04 in WSL 2 if it's relevant. The same code was running fine a few versions ago using nvcc
.
Does the tool ./bin/nvptx-arch
return anything? Pretty good litmus for if we're detecting the GPU and runtime correctly at all.
@hypdeb Could you please try --repo_env=CC=clang
? I am not sure how it goes if you mix gcc as cc compiler and clang as cuda compiler
@hypdeb Could you please try
--repo_env=CC=clang
? I am not sure how it goes if you mix gcc as cc compiler and clang as cuda compiler
If we're doing -fno-gpu-rdc
(default) I would expect it to work because all the CUDA specific handling is done per-TU. But it's worth a shot.
@hypdeb Could you please try
--repo_env=CC=clang
? I am not sure how it goes if you mix gcc as cc compiler and clang as cuda compiler
I am already using clang as my cc compiler: https://github.com/hypdeb/lawrencium/blob/1694b0f1707d2bc6d2a782a734749ae1c1379336/toolchain/cc_toolchain_config.bzl#L24
Does the tool
./bin/nvptx-arch
return anything? Pretty good litmus for if we're detecting the GPU and runtime correctly at all.
nvptx-arch
returns sm_75
. I tried with the --offload-arch=native
flag too without success.
llc -version
returns:
nvptx - NVIDIA PTX 32-bit
nvptx64 - NVIDIA PTX 64-bit
among many others.
Here are the exact commands run by Bazel:
/usr/local/llvm/bin/clang \
-x cu \
'--cuda-path=/usr/local/cuda-12.1' \
'-frandom-seed=bazel-out/amd64-fastbuild/bin/src/cuda/_objs/thrust_cu/thrust.o' \
-iquote . \
-iquote bazel-out/amd64-fastbuild/bin \
-iquote external/local_cuda \
-iquote bazel-out/amd64-fastbuild/bin/external/local_cuda \
-isystem external/local_cuda/cuda/include \
-isystem bazel-out/amd64-fastbuild/bin/external/local_cuda/cuda/include \
-U_FORTIFY_SOURCE \
-fstack-protector \
-Wall \
-Wthread-safety \
-Wself-assign \
-Wunused-but-set-parameter \
-Wno-free-nonheap-object \
-fcolor-diagnostics \
-fno-omit-frame-pointer \
'-stdlib=libc++' \
'--offload-arch=native' \
-c src/cuda/thrust.cu \
-o bazel-out/amd64-fastbuild/bin/src/cuda/_objs/thrust_cu/thrust.o \
-fPIC
and then
/usr/local/llvm/bin/clang -o bazel-out/amd64-fastbuild/bin/src/cuda/thrust_main \
bazel-out/amd64-fastbuild/bin/src/cuda/libthrust_cu.a \
external/local_cuda/cuda/lib64/libcudart_static.a \
external/local_cuda/cuda/lib64/libcudadevrt.a \
-ldl -lpthread -lrt \
-Wl,-S \
'-std=c++23' \
'-stdlib=libc++' \
'-fuse-ld=lld' \
-lc++ -lc++abi \
-static -lm \
-no-canonical-prefixes \
-L/usr/local/llvm/lib
I think it's unlikely the issue is with rules_cuda
at this point to be hones and I don't want to pollute this thread too much. I should probably try re-building my whole environment from scratch, it's possible I screwed something up along the way as it was a long journey :D Thanks for the amazing rules by the way, it works great with nvcc
:)
Close with #158