cub Refactor Thrust/CUB dispatch mechanisms to not rely on `__CUDA_ARCH_

Mar 29 '21 19:03 brycelelbach

This is relevant to my interests. I been blocked by this since 2019. What's the new mechanism?

Apr 15 '21 09:04 seanbaxter

@seanbaxter There's a work-in-progress in #276.

cub/device/dispatch/dispatch_reduce.cuh is an example of how the new mechanism is used, and the implementation is in cub/util_ptx_dispatch.cuh (For now...I'm planning to move it to cub/detail/ptx_dispatch.cuh today).

Basically:

Tag each policy class with a constexpr static int ptx_arch = XXX member.
Add all policies to a cub::detail::type_list.
Use cub::detail::ptx_dispatch to invoke a functor with the best matching policy.

I'm curious, though -- what are you doing with this? Are you using the CUB dispatch in another project?

Apr 15 '21 17:04 alliepiper

I was compiling CUDA Toolkit samples without modification with a single translation pass in 2019:

https://twitter.com/seanbax/status/1122594331720847360
https://twitter.com/seanbax/status/1125498748216512516

I ended up removing this functionality because CUDA_ARCH-based dispatch was too hard to treat robustly as the projects got more complicated. I made some requests to change this dispatch mechanism, and now it looks like it's happening. I'm looking to re-enable CUDA targets in my compiler. I want to compile CUB/Thrust with a single pass, as we all do.

For background, I had an if-codegen statement which did operand substitution at codegen time rather than template substitution, and that appears to be exactly the same as nvc++'s if-target. There was an nvvm_arch_t enum with SM versions, which is like your sm_selector. https://www.circle-lang.org/saxpy.html#how-to-dispatch

I looked over <__target_macros> closely.. Three small questions:

Are __NV_PROVIDES_SM_XX and __NV_IS_EXACTLY_SM_XX not actually macros but rather implicit bool declarations that get plugged true/false at codegen depending on the target? I'm cool if they are. If not, I have no idea what's going on.
Is nv::target::device also an implicit integer declaration holding the current SM target?
Does [[nv::__target_bitset]] implicitly declare a target_description member function along the lines of:

explicit operator bool () const noexcept { returns nv::target::device & targets; }

?

Just a bit confused by the macro name style in 1 and the C++ name style in 2. Maybe they're unrelated things.

Apr 15 '21 18:04 seanbaxter

Ah ok - Everything in cub/detail/nv/ is a temporary snapshot from NVIDIA/libcudacxx#144, this PR just updates Thrust/CUB to use the libcu++ NV_IF_TARGET macros. @wmaxey, @dkolsen-pgi, and @brycelelbach wrote and designed those macros and would be better able to answer your questions about the if-target design.

Apr 15 '21 19:04 alliepiper

nv::target::device is just a set of targets. The actual codegen-known value does not manifest in user code at any point.

Apr 15 '21 20:04 griwes

@griwes That doesn't clarify. Is nv::target::device is the bit-field of targets specified at the command line? Where is the active target manifested? It has to be somewhere.

Apr 15 '21 20:04 seanbaxter

I'm not sure how much I can say at this point, but the short answer is that that value doesn't actually ever manifest directly in a variable, and the backend understands the bitset.

Apr 15 '21 20:04 griwes

I'm just trying to compile CUB/Thrust with this new dispatch mechanic. What are the implicit declarations needed to do that?

Apr 15 '21 20:04 seanbaxter

The value of an if target expression is a bitset that encodes all the desired targets. There is compiler magic that interprets the bitset and generates the correct code. If you want Circle to use the new if target mechanism rather than __CUDA_ARCH__, the Circle compiler will have to have the same compiler magic. This is not the right place to explain all the details of how it works; we would need to set up a meeting.

Apr 15 '21 21:04 dkolsen-pgi

Are __NV_PROVIDES_SM_XX and __NV_IS_EXACTLY_SM_XX not actually macros but rather implicit bool declarations that get plugged true/false at codegen depending on the target? I'm cool if they are. If not, I have no idea what's going on.

They are used for assembling the right token when preprocessing with __CUDA_ARCH__. The whole mechanism should be uninteresting as there is nothing really different or new happening with NVCC. This is just an abstraction to allow writing compiler independent code.

In the back-end, these tokens are concatenated to obtain predefined boolean values on NVCC. It was done this way, maybe naively, because NVC++ and NVCC create the dispatch at different stages in compilation.

Apr 16 '21 00:04 wmaxey

Closing in favor of https://github.com/NVIDIA/cccl/issues/65

Apr 26 '23 20:04 jrhemstad

cub
cub copied to clipboard

Refactor Thrust/CUB dispatch mechanisms to not rely on `__CUDA_ARCH__`

cub cub copied to clipboard

Refactor Thrust/CUB dispatch mechanisms to not rely on `__CUDA_ARCH__`

cub
cub copied to clipboard