cub
cub copied to clipboard
Refactor Thrust/CUB dispatch mechanisms to not rely on `__CUDA_ARCH__`
This is relevant to my interests. I been blocked by this since 2019. What's the new mechanism?
@seanbaxter There's a work-in-progress in #276.
cub/device/dispatch/dispatch_reduce.cuh
is an example of how the new mechanism is used, and the implementation is in cub/util_ptx_dispatch.cuh
(For now...I'm planning to move it to cub/detail/ptx_dispatch.cuh
today).
Basically:
- Tag each policy class with a
constexpr static int ptx_arch = XXX
member. - Add all policies to a
cub::detail::type_list
. - Use
cub::detail::ptx_dispatch
to invoke a functor with the best matching policy.
I'm curious, though -- what are you doing with this? Are you using the CUB dispatch in another project?
I was compiling CUDA Toolkit samples without modification with a single translation pass in 2019:
- https://twitter.com/seanbax/status/1122594331720847360
- https://twitter.com/seanbax/status/1125498748216512516
I ended up removing this functionality because CUDA_ARCH-based dispatch was too hard to treat robustly as the projects got more complicated. I made some requests to change this dispatch mechanism, and now it looks like it's happening. I'm looking to re-enable CUDA targets in my compiler. I want to compile CUB/Thrust with a single pass, as we all do.
For background, I had an if-codegen statement which did operand substitution at codegen time rather than template substitution, and that appears to be exactly the same as nvc++'s if-target. There was an nvvm_arch_t
enum with SM versions, which is like your sm_selector
.
https://www.circle-lang.org/saxpy.html#how-to-dispatch
I looked over <__target_macros> closely.. Three small questions:
- Are __NV_PROVIDES_SM_XX and __NV_IS_EXACTLY_SM_XX not actually macros but rather implicit bool declarations that get plugged true/false at codegen depending on the target? I'm cool if they are. If not, I have no idea what's going on.
- Is nv::target::device also an implicit integer declaration holding the current SM target?
- Does [[nv::__target_bitset]] implicitly declare a
target_description
member function along the lines of:
explicit operator bool () const noexcept { returns nv::target::device & targets; }
?
Just a bit confused by the macro name style in 1 and the C++ name style in 2. Maybe they're unrelated things.
Ah ok - Everything in cub/detail/nv/
is a temporary snapshot from NVIDIA/libcudacxx#144, this PR just updates Thrust/CUB to use the libcu++ NV_IF_TARGET
macros. @wmaxey, @dkolsen-pgi, and @brycelelbach wrote and designed those macros and would be better able to answer your questions about the if-target design.
nv::target::device
is just a set of targets. The actual codegen-known value does not manifest in user code at any point.
@griwes That doesn't clarify. Is nv::target::device
is the bit-field of targets specified at the command line? Where is the active target manifested? It has to be somewhere.
I'm not sure how much I can say at this point, but the short answer is that that value doesn't actually ever manifest directly in a variable, and the backend understands the bitset.
I'm just trying to compile CUB/Thrust with this new dispatch mechanic. What are the implicit declarations needed to do that?
The value of an if target
expression is a bitset that encodes all the desired targets. There is compiler magic that interprets the bitset and generates the correct code. If you want Circle to use the new if target
mechanism rather than __CUDA_ARCH__
, the Circle compiler will have to have the same compiler magic. This is not the right place to explain all the details of how it works; we would need to set up a meeting.
- Are __NV_PROVIDES_SM_XX and __NV_IS_EXACTLY_SM_XX not actually macros but rather implicit bool declarations that get plugged true/false at codegen depending on the target? I'm cool if they are. If not, I have no idea what's going on.
They are used for assembling the right token when preprocessing with __CUDA_ARCH__
. The whole mechanism should be uninteresting as there is nothing really different or new happening with NVCC. This is just an abstraction to allow writing compiler independent code.
In the back-end, these tokens are concatenated to obtain predefined boolean values on NVCC. It was done this way, maybe naively, because NVC++ and NVCC create the dispatch at different stages in compilation.
Closing in favor of https://github.com/NVIDIA/cccl/issues/65