rules_cuda icon indicating copy to clipboard operation
rules_cuda copied to clipboard

Support clang as a CUDA compiler

Open Artem-B opened this issue 2 years ago • 16 comments

Clang is capable of CUDA compilation these days.

It would be great to add support for using it for CUDA compilation with bazel.

Artem-B avatar May 19 '22 15:05 Artem-B

Yeah, there is plan. I have local modification but it doesn't make it now.

cloudhan avatar May 20 '22 02:05 cloudhan

@Artem-B Considering you are one of the devs of clang cuda, I have a question about it. I now have a nearly working configuration for clang. The only problem I am facing is, for example:

__global__ void kernel() {
  // blahblah with impl
}

with nvcc, the compiled object have symbol

000000000000019a T __device_stub__Z6kernelv()
00000000000002aa T kernel()

but with clang version 15.0.0 (https://github.com/llvm/llvm-project.git 009d56da5c4ea3666c4753ce7564c8c20d7e0255)

0000000000000000 T __device_stub__kernel()

the kernel() is missing from the object, which is causing my nccl example a linker error for various __global__ functions. In that example a libnccl.so is produced first then link to various perf tests binaries. I am curious what it the root problem here.

cloudhan avatar May 30 '22 16:05 cloudhan

nvcc and clang do have somewhat different behavior under the hood and you're likely dealing with more than one issue here.

First, kernels and stubs. Clang indeed no longer has kernel on the host side. https://cuda.godbolt.org/z/833ejYs1c The main reason for that was to make it possible to distinguish the host-side stub from the actual kernel function in debugger on AMD GPUs. It does mean that CUDA objects compiled with clang will not be able to use <<<...>>> to launch kernels from objects compiled with nvcc. This is probably not a big deal, as mixing two different compilers in the same build is a bad idea to start with. I do not think that's the root cause of the problem you see.

NCCL does rely on RDC compilation (i.e. each source compiles to a GPU object file, instead of a fully linked GPU executable) and that part works very differently in clang vs nvcc. In a nutshell, object files need an extra final linking step, and a bit of extra host-side 'glue' code. NVCC does that under the hood. Clang does not, yet.

Here's how tensorflow implements cuda_rdc_library it uses to compile nccl: https://github.com/tensorflow/tensorflow/blob/ee7cf722e4ca9d02b6e62eb3d1b7506ead995422/third_party/nccl/build_defs.bzl.tpl#L261

@jhuber6 has been working on clang driver changes to make compile-to-object-and-link "just work" on the GPU side. E.g. https://github.com/llvm/llvm-project/commit/b7c8c4d8cf07d2e9e8cd157bccc8bd9e7c76415a It's an experimental feature at the moment, but it will make things much simpler, once it's ready.

Artem-B avatar May 30 '22 18:05 Artem-B

Also, I believe CMake has recently added support for clang as the CUDA compiler. It may be worth checking whether/how they handle RDC compilation there.

Artem-B avatar May 30 '22 18:05 Artem-B

I think the expected way to perform RDC-mode compilation is via the CUDA_SEPARABLE_COMPILATION option. I think this is supported for Clang as well judging by this issue.

jhuber6 avatar May 30 '22 19:05 jhuber6

Judging by the commit that has implemented it in cmake they did use the same RDC compilation process that we've implemented in tensorflow that I've pointed to above.

You may as well just pick up tensorflow's implementation directly. @chsigg would probably be the most familiar with the details, if you have questions.

Artem-B avatar May 30 '22 21:05 Artem-B

OK, problem solved, it turns out that I need compile all C code of nccl as cuda with -x cu. Otherwise, there will be linker error caused by global functions. Also metioned here.

cloudhan avatar May 31 '22 13:05 cloudhan

@Artem-B I think this is addressed in c13ebaada58956435658b18bb91fb46a23534995

Use --@rules_cuda//cuda:compiler=clang to select compiler, See detect_clang macro for auto detecting, otherwise, you'll need to bring the toolchain config and registration yourself.

As you are a member of tensorflow, I am wondering if this can be mentioned or evaluated in tf. Might be good for bazel community ;)

cloudhan avatar Jun 10 '22 07:06 cloudhan

@Artem-B I think this is addressed in https://github.com/cloudhan/rules_cuda/commit/c13ebaada58956435658b18bb91fb46a23534995

Do you mean that your build rules are ready to use?

I guess the right sequence to make it all work is to get these rules upstreamed into bazel, then port TF build to it. I'll see what I can do to bring it to the attention of the right people.

Artem-B avatar Jun 10 '22 14:06 Artem-B

Not necessary production ready, but at least usable. It needs more users to test it out before I can say it is production ready. Because it is a build system, there are too many corner cases in it.

cloudhan avatar Jun 10 '22 14:06 cloudhan

One thing that could serve as a motivation to adopt these changes would be to try getting Tensorflow to build using your rules, instead of the ones TF carries. It would be a pretty decent test of the real-world usability of the rules -- TF is probably the largest bazel user outside of Google and is almost certainly the largest user of clang for CUDA compilations. Having them convinced would go a long way towards convincing bazel owners that these rules should to be part of bazel.

Having a proof of concept at that level would also give TF owners rough idea how much work it would take to adopt it and whether it's worth it. One thing to keep in mind is that TF also has to work with our internal build. I don't know yet how hard it would be to switch to your rules. If it's a drop-in replacement of the cuda_library() implementation, it should be doable. NCCL and other RDC compilation users would need some more work, but it should be manageable, too.

Artem-B avatar Jun 10 '22 16:06 Artem-B

@Artem-B Do we have prebuilt llvm package with NVPTX backend enabled. I'd like adding a building CI. So that I can confidently close this issue finally.

cloudhan avatar Jul 12 '22 15:07 cloudhan

LLVM/Clang releases should have NVPTX built in. E.g https://github.com/llvm/llvm-project/releases/tag/llvmorg-14.0.6

On a side note, just a FYI that there's been a lot of offloading-related changes in clang driver lately that are going to make GPU compilation much closer to C++ compilation. E.g. RDC compilation would "just work" -- clang -c a.cu -o a.o; clang -s b.cu -o b.o; clang -o app a.o b.o would do the job, with compiler and linker taking care of the GPU-side linking. It's still work in progress, but when it is ready that would help to simplify the build process quite a bit and will make things like GPU-side LTO possible with clang.

Artem-B avatar Jul 12 '22 16:07 Artem-B

If you want to try out the new driver I would appreciate it. For compiling an application in RDC mode you can do the following.

clang++ a.cu b.cu --offload-new-driver -fgpu-rdc --offload-arch=sm_70 -c
clang++ a.o b.o --offload-link -lcudart

Right now what's missing from the new driver is support for textures / surfaces, Windows / MacOS support, and compiling in non-RDC mode. The benefits are simplified compilation, static library support, and LTO among others.

jhuber6 avatar Jul 12 '22 20:07 jhuber6

It'd be good to be able to load clang from https://github.com/grailbio/bazel-toolchain so that we can have a hermetic toolchain setup. I'll probably look into this at some point soon as we're already using that toolchain for our host builds and will be using rules_cuda soon within one of our product builds.

jsharpe avatar Nov 03 '22 23:11 jsharpe

llvm apt clang is also built with NVPTX enabled, we can use that too.

cloudhan avatar Nov 05 '22 01:11 cloudhan

This is partially fixed by #143. Later I will add a full integration test by adding nccl as an example. The cloudhan/nccl-example branch should be buildable with both clang and nvcc.

cloudhan avatar Aug 09 '23 15:08 cloudhan

Is there any flags I should add besides maybe those:

build:clang --@rules_cuda//cuda:compiler='clang'
build:clang --@rules_cuda//cuda:archs=compute_61:compute_61,sm_61
build:clang --@rules_cuda//cuda:runtime=@local_cuda//:cuda_runtime_static
build:clang --@rules_cuda//cuda:copts='-stdlib=libc++'

in theory for this to work ? I'm having a weird issue: eveything compiles fine, but then on execution it just dies without any output. Maybe I'm living a bit too close to the edge using clang 17 and CUDA 12.1 ? It does say it's only partially supported...

My whole setup is available here: https://github.com/hypdeb/lawrencium.

hypdeb avatar Aug 10 '23 16:08 hypdeb

Is there any flags I should add besides maybe those:

build:clang --@rules_cuda//cuda:compiler='clang'
build:clang --@rules_cuda//cuda:archs=compute_61:compute_61,sm_61
build:clang --@rules_cuda//cuda:runtime=@local_cuda//:cuda_runtime_static
build:clang --@rules_cuda//cuda:copts='-stdlib=libc++'

in theory for this to work ? I'm having a weird issue: eveything compiles fine, but then on execution it just dies without any output. Maybe I'm living a bit too close to the edge using clang 17 and CUDA 12.1 ? It does say it's only partially supported...

At least running something like this should be definitely supported. The only time I've seen errors like this in the past is when there's no supported architecture it tends to just silently die. E.g. if I compile for sm_52 but I have an sm_70 card. For executing a basic program I would expect something like the following to work,

$ clang -x cuda cuda.cpp --offload-arch=native -L/opt/cuda/lib -lcudart

If you're using RDC-mode w/ clang you'll need to opt-in.

$ clang -x cuda cuda.cpp --offload-arch=native -L/opt/cuda/lib -lcudart --offload-new-driver -fgpu-rdc

Using native should auto-detect what card you have installed, it won't work if you're building on a different machine than what you run on.

jhuber6 avatar Aug 10 '23 16:08 jhuber6

Thanks for the extremely fast and detailed response. I just tried a few things based on your inputs, but no luck. I should add that I'm working in Ubuntu 22.04 in WSL 2 if it's relevant. The same code was running fine a few versions ago using nvcc.

hypdeb avatar Aug 10 '23 16:08 hypdeb

Does the tool ./bin/nvptx-arch return anything? Pretty good litmus for if we're detecting the GPU and runtime correctly at all.

jhuber6 avatar Aug 10 '23 16:08 jhuber6

@hypdeb Could you please try --repo_env=CC=clang? I am not sure how it goes if you mix gcc as cc compiler and clang as cuda compiler

cloudhan avatar Aug 10 '23 17:08 cloudhan

@hypdeb Could you please try --repo_env=CC=clang? I am not sure how it goes if you mix gcc as cc compiler and clang as cuda compiler

If we're doing -fno-gpu-rdc (default) I would expect it to work because all the CUDA specific handling is done per-TU. But it's worth a shot.

jhuber6 avatar Aug 10 '23 17:08 jhuber6

@hypdeb Could you please try --repo_env=CC=clang? I am not sure how it goes if you mix gcc as cc compiler and clang as cuda compiler

I am already using clang as my cc compiler: https://github.com/hypdeb/lawrencium/blob/1694b0f1707d2bc6d2a782a734749ae1c1379336/toolchain/cc_toolchain_config.bzl#L24

Does the tool ./bin/nvptx-arch return anything? Pretty good litmus for if we're detecting the GPU and runtime correctly at all.

nvptx-arch returns sm_75. I tried with the --offload-arch=native flag too without success.

llc -version returns:

    nvptx       - NVIDIA PTX 32-bit
    nvptx64     - NVIDIA PTX 64-bit

among many others.

hypdeb avatar Aug 10 '23 17:08 hypdeb

Here are the exact commands run by Bazel:

/usr/local/llvm/bin/clang \
    -x cu \
    '--cuda-path=/usr/local/cuda-12.1' \
    '-frandom-seed=bazel-out/amd64-fastbuild/bin/src/cuda/_objs/thrust_cu/thrust.o' \
    -iquote . \
    -iquote bazel-out/amd64-fastbuild/bin \
    -iquote external/local_cuda \
    -iquote bazel-out/amd64-fastbuild/bin/external/local_cuda \
    -isystem external/local_cuda/cuda/include \
    -isystem bazel-out/amd64-fastbuild/bin/external/local_cuda/cuda/include \
    -U_FORTIFY_SOURCE \
    -fstack-protector \
    -Wall \
    -Wthread-safety \
    -Wself-assign \
    -Wunused-but-set-parameter \
    -Wno-free-nonheap-object \
    -fcolor-diagnostics \
    -fno-omit-frame-pointer \
    '-stdlib=libc++' \
    '--offload-arch=native' \
    -c src/cuda/thrust.cu \
    -o bazel-out/amd64-fastbuild/bin/src/cuda/_objs/thrust_cu/thrust.o \
    -fPIC

and then

/usr/local/llvm/bin/clang -o bazel-out/amd64-fastbuild/bin/src/cuda/thrust_main \
    bazel-out/amd64-fastbuild/bin/src/cuda/libthrust_cu.a \
    external/local_cuda/cuda/lib64/libcudart_static.a \
    external/local_cuda/cuda/lib64/libcudadevrt.a \
    -ldl -lpthread -lrt \
    -Wl,-S \
    '-std=c++23' \
    '-stdlib=libc++' \
    '-fuse-ld=lld' \
    -lc++ -lc++abi \
    -static -lm \
    -no-canonical-prefixes \
    -L/usr/local/llvm/lib

hypdeb avatar Aug 10 '23 17:08 hypdeb

I think it's unlikely the issue is with rules_cuda at this point to be hones and I don't want to pollute this thread too much. I should probably try re-building my whole environment from scratch, it's possible I screwed something up along the way as it was a long journey :D Thanks for the amazing rules by the way, it works great with nvcc :)

hypdeb avatar Aug 10 '23 17:08 hypdeb

Close with #158

cloudhan avatar Sep 04 '23 17:09 cloudhan