taichi [RFC] [SIMT] Add CUDA warp-level intrinsics to Taichi

(For people who are familiar with CUDA/LLVM, this is a good starting issue. For most intrinsics, you will only need to write < 10 LoC to implement the API, and < 50 LoC to test it. Come join us! :-)

Intro

There has been an increasing Taichi user need for writing high-performance SIMT kernels. For these use cases, it is fine to sacrifice a certain level of portability.

Currently, when running on CUDA, Taichi already follows the SIMT execution model. However, it lacks support for warp-level and block-level intrinsics (e.g.,__ballot_sync and __syncthreads) that are often needed in fancy SIMT kernels.

Implementation plan

We support CUDA warp-level intrinsics only, as the first step, in this issue
In the longer term, may consider supporting other backends such as SPIR-V, Metal, AMDGPU etc. We may also consider other intrinsic such as __syncthreads and add explicit shared memory support. We may even consider TensorCore and ray-tracing intrinsics.

List of CUDA warp-level intrinsic

We plan to implement all of the following warp-level intrinsics:

[x] __all_sync (should be named all_nonzero in our API to avoid conflict with all in Python) (by @varinic, https://github.com/taichi-dev/taichi/pull/4718)
[x] __any_sync (should be named any_nonzero to avoid conflict with any in Python) (by @varinic, https://github.com/taichi-dev/taichi/pull/4719)
[x] __uni_sync (should be named unique) (by @0xzhang, https://github.com/taichi-dev/taichi/pull/4927#event-6570449057)
[x] __ballot_sync (by @Wimacs, https://github.com/taichi-dev/taichi/pull/4641)
[x] __shfl_sync (i32) (by @varinic https://github.com/taichi-dev/taichi/pull/4717)
[x] __shfl_sync (f32) (by @varinic https://github.com/taichi-dev/taichi/pull/4717)
[x] __shfl_up_sync (i32) (by @YuCrazing, https://github.com/taichi-dev/taichi/pull/4632)
[x] __shfl_up_sync (f32) (by @YuCrazing, https://github.com/taichi-dev/taichi/pull/4632)
[x] __shfl_down_sync (i32) (by @yuanming-hu, https://github.com/taichi-dev/taichi/pull/4616)
[x] __shfl_down_sync (f32) (by @caic99, https://github.com/taichi-dev/taichi/pull/4819)
[x] __shfl_xor_sync (by @varinic, WIP, https://github.com/taichi-dev/taichi/pull/4642)
[x] __match_any_sync (by @GaleSeLee, https://github.com/taichi-dev/taichi/pull/4921)
[x] __match_all_sync (by @GaleSeLee, https://github.com/taichi-dev/taichi/pull/4961)
[x] __activemask (by @GaleSeLee, https://github.com/taichi-dev/taichi/pull/4918)
[x] __syncwarp (by @GaleSeLee, https://github.com/taichi-dev/taichi/pull/4917 )

See here and CUDA doc for more details :-)

API

We may pick one of the following API formats, depending on whether warp-level and block-level intrinsics should be put under the same namespace:

ti.simt.X, such as ti.simt.ballot() and ti.simt.warp_sync()
ti.simt.warp.X, such as ti.simt.warp.ballot() and ti.simt.warp.sync()
Other ideas?

Please let me know which one you guys prefer :-)

Example

Computing sum of all values in a warp using shfl_down:

@ti.func
def warp_reduce(val):
    mask = ti.u32(0xFFFFFFFF)
    # assuming warp_size = 32 and no outside warp divergence
    val += ti.simt.warp.shfl_down(mask, val, 16)
    val += ti.simt.warp.shfl_down(mask, val, 8)
    val += ti.simt.warp.shfl_down(mask, val, 4)
    val += ti.simt.warp.shfl_down(mask, val, 2)
    val += ti.simt.warp.shfl_down(mask, val, 1)
    return val

Steps and how we collaborate

Implement the infrastructure for the intrinsics. We will use InternalFuncCallExpression and InternalFuncStmt. One issue is that in the LLVM codegen the generated function takes RuntimeContext *, which is not needed. We need to make that optional. (Update: this is done in https://github.com/taichi-dev/taichi/pull/4616)
Implement all the intrinsics and add corresponding test cases
Decide which namespace to use, and put all the intrinsics to that namespace. Before we reach a consensus, let's use ti.simt.warp.X.
Add documentation

Currently we are at step 2. For everyone who wants to contribute to this, please take one single intrinsic function to implement in a PR. That would simplify review and testing.

Please leave a comment (e.g., "I'll take care of ti.simt.wary.shfl!") in this PR, so that other community members know that you are working on it and we avoid duplicated work.

For example, if you wish to implement ballot, fill in

https://github.com/taichi-dev/taichi/blob/84973201e488bfcce1fa980457fe74e9141cefb3/python/taichi/lang/simt.py#L20-L22

and

https://github.com/taichi-dev/taichi/blob/84973201e488bfcce1fa980457fe74e9141cefb3/tests/python/test_simt.py#L23-L26

An example PR: https://github.com/taichi-dev/taichi/pull/4632

What we already have

Scaffold code and `shfl_down_i32`

I went ahead and implemented https://github.com/taichi-dev/taichi/pull/4616

LLVM -> NVVM -> PTX code path

We already have a bunch of functions that wrap most of these intrinsics: https://github.com/taichi-dev/taichi/blob/bee97d50335dd1038bd5e3de9d9385da56a0744f/taichi/llvm/llvm_context.cpp#L355-L369

Therefore, for most of the cases, with high probability, the intrinsics can be implemented simply in 3-4 lines of code (+ tests). We can just call these functions. For example,

https://github.com/taichi-dev/taichi/blob/22d189519ea1cbef965461618643a38784963bdd/python/taichi/lang/simt/warp.py#L81-L88

Milestone

Implement GPU parallel scan (prefix sum)? That would be very useful in particle simulations. Ideas are welcome!

Future steps: making Taichi (kind of) a superset of CUDA!

Explicit shared memory operation support
Other block-level and other intrinsics: __syncthreads, __threadfence etc.
ti.raw_kernel, something that provides 1:1 mapping to a __global__ CUDA kernel

Appendix: List of higher-level primitives (in Vulkan, Metal, etc. & implements as helpers in CUDA)

Some of these exist in CUDA directly, however the scope of execution (i.e. mask) is not involved, and sync behavior is guaranteed, therefore it can not be directly mapped 1:1 with CUDA, helper functions are needed. (Reference: https://www.youtube.com/watch?v=fP1Af0u097o where Nvidia talked about implementing these in the drivers)

[x] subgroupBarrier Execution barrier
[x] subgroupMemoryBarrier Memory fence
[x] subgroupElect Elect a single invocation as leader (very useful in atomic reduction)
[ ] subgroupAll
[ ] subgroupAny
[ ] subgroupAllEqual
[x] subgroupBroadcast （might be tricky as the id that is broadcasting from is compile time constant`)
[ ] subgroupBroadcastFirst (use the lowest id active invocation)
[ ] Other ballot options (GL_KHR_shader_subgroup_ballot)
[x] Subgroup arithmetic (Useful in reduction primitives)
- [x] subgroupAdd
- [x] subgroupMul
- [x] subgroupMin
- [x] subgroupMax
- [x] subgroupAnd
- [x] subgroupOr
- [x] subgroupXor
[x] Subgroup inclusive scan arithmetic (Like subgroup arithmetic, but the result is an inclusive scan)
[ ] subgroupShuffle
[ ] subgroupShuffleXor
[ ] subgroupShuffleUp
[ ] subgroupShuffleDown

Mar 25 '22 15:03 yuanming-hu

Extension: Add Warp size query and control. Warp level intrinsics exists in Vulkan and Metal, and on those platforms some devices use warp size different from 32, some devices even allow custom warp sizes. (subgroup size control & subgroup operations)

Mar 25 '22 15:03 bobcao3

@bobcao3 Can't agree more! :-)

Mar 25 '22 15:03 yuanming-hu

Changes i would like to see: in addition to using CUDA's warp level primitives, we should look into adding higher level intrinsics directly such as subgroup add, subgroup scan, etc. These are supported in Vulkan and device driver will provide optimal implementation depending on the device. On CUDA we can provide our own wrapper for these higher level primitives. Reference: https://www.khronos.org/blog/vulkan-subgroup-tutorial

Mar 25 '22 15:03 bobcao3

Would love to see this! Btw Metal has pretty good warp intrinsics support as well (they call it SIMD-group). See table 6.13 in https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf

Mar 26 '22 00:03 AmesingFlank

One addition to this proposal: warp intrinsics is a great add-on, but in the meantime, we also need a design to formalize our parallelization strategy. Right now it's quite vague to users how a Taichi for iteration is mapped to a GPU thread (TLDR; it's backend-dependent..) I think we need to offer explicit spec on this (cc @strongoier).

Mar 26 '22 03:03 k-ye

Should we remove the mask part from the intrinsics? It seems like only CUDA and AVX512 supports these masks.

Mar 26 '22 23:03 bobcao3

I want to take care of __ballot_sync intrinsics!

Mar 27 '22 06:03 Wimacs

Continuing discussions on @bobcao3's question:

Should we remove the mask part from the intrinsics? It seems like only CUDA and AVX512 supports these masks. We can hard code it to all for now, but due to the complexity in the scheduling and non-guranteed lock-step execution, using the right mask probably needs the compiler to figure out the whether there can be divergence or not (when there's divergence, we need to run int mask = __match_any_sync(__activemask(), data); to get the right mask) I think handing masks over to the user may make it significantly harder to code, while also breaking compatibility with non CUDA devices)

My opinion: I agree exposing masks can be extra trouble for users, and can harm portability. Does anyone know a frequent CUDA use case where explicitly specifying the masks is helpful? If not then maybe we should not expose masks.

Mar 27 '22 11:03 yuanming-hu

I agree exposing masks can be extra trouble for users, and can harm portability.

Also vote for hiding the masks beneath Taichi's interface.

The masks are extremely troublesome and hard to understand especially in Taichi, as we have hidden a lot many parallelization details for elegant parallel programming. The prerequisite to expose mask is a set of more direct APIs to manipulate parallelization.

Does anyone know a frequent CUDA use case where explicitly specifying the masks is helpful?

Special stencil patterns covering specific near neighbors (star stencil etc.) might need special masks, but such optimizations can be handled internally in Taichi. We can also quickly add the mask APIs when needed.

Mar 27 '22 12:03 turbo0628

According to the CUDA API, the masking behavior is really unexpected. If an active thread executing an instruction where it is not in the mask yields unexpected behavior, this the mask is only an convergence requirement. Now comes the tricky part, there's no explicit convergence requirement in CUDA, thus the mask must be queried everytime we've taken a branch. Using the ALL mask in divergent control flow can result in GPU hang, while using __activethread() does not guarantee a reconvergence after branching. Thus we should definitely hide the mask, but it also seems quite tricky to implement masks internally. I would say we need to maintain an mask variable once we encountered an IfStmt.

Mar 27 '22 16:03 bobcao3

Mask in vector processing like AVX512 or RiscV Vectors are very different from CUDA.

Mar 27 '22 16:03 bobcao3

I would like to take care of __shfl_xor_sync intrinsics!

Mar 27 '22 17:03 varinic

Got I naive question. If I want to implement a task in the issue or other opened issue, how do I know that maybe somebody do the same work as me.

Mar 28 '22 04:03 DongqiShen

Got I naive question. If I want to implement a task in the issue or other opened issue, how do I know that maybe somebody do the same work as me.

Good question. As long as nobody says "I'll take this task" and the issue has no assignee, you are safe to assume that nobody is working on it. Before you start coding, it would be nice to leave a comment "let me implement XXXX" so that people know you are working on it :-)

Mar 28 '22 05:03 yuanming-hu

Changes i would like to see: in addition to using CUDA's warp level primitives, we should look into adding higher level intrinsics directly such as subgroup add, subgroup scan, etc. These are supported in Vulkan and device driver will provide optimal implementation depending on the device. On CUDA we can provide our own wrapper for these higher level primitives.

Recent NV gpus (Ampere and later) also support reduce_sync variant of intrinsics: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#warp-reduce-functions In particular, this slide on page 47 says __reduce_op_sync warp intrinsics are faster than warp shuffle based implementation by 10x. https://developer.download.nvidia.com/video/gputechconf/gtc/2020/presentations/s21170-cuda-on-nvidia-ampere-gpu-architecture-taking-your-algorithms-to-the-next-level-of-performance.pdf?t=eyJscyI6ImdzZW8iLCJsc2QiOiJodHRwczpcL1wvd3d3Lmdvb2dsZS5jb21cLyIsIm5jaWQiOiJlbS1hbm5vLTkyMTMzOS12dDIwIn0

Apr 25 '22 04:04 masahi

Recent NV gpus (Ampere and later) also support reduce_sync variant of intrinsics: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#warp-reduce-functions In particular, this slide on page 47 says __reduce_op_sync warp intrinsics are faster than warp shuffle based implementation by 10x. https://developer.download.nvidia.com/video/gputechconf/gtc/2020/presentations/s21170-cuda-on-nvidia-ampere-gpu-architecture-taking-your-algorithms-to-the-next-level-of-performance.pdf?t=eyJscyI6ImdzZW8iLCJsc2QiOiJodHRwczpcL1wvd3d3Lmdvb2dsZS5jb21cLyIsIm5jaWQiOiJlbS1hbm5vLTkyMTMzOS12dDIwIn0

Wow, that sounds quite attractive. Thanks for pointing this out. We need to dispatch the code according to compute capability. One place to look at: https://github.com/taichi-dev/taichi/blob/d82ea9045792a1f14a04a03c0b9292bb4c7235c5/taichi/runtime/llvm/locked_task.h#L28

@qiao-bo Could you add this to the feature list and coordinate its development? Many thanks!

Apr 25 '22 05:04 yuanming-hu

@yuanming-hu @masahi It turns out a bit difficult to support the new reduce warp intrinsics at this moment. For example, __reduce_add_sync (i32) will need to be mapped to redux.sync.add.s32. This new redux keyword is only supported since LLVM13 (https://github.com/llvm/llvm-project/blob/release/13.x/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td). Also tried bypassing NVVM and just use ptx asm in our runtime, but then llvm10 wouldn't let us because of the ptx jit compilation.

The migration to LLVM 12 is on our roadmap. Nevertheless, it may still lack the support of this warp reduce ;). For the purpose of this issue, I suggest to move this feature proposal to another issue for later work. WDYT?

Apr 27 '22 14:04 qiao-bo

Sounds good - we probably need to postpone the implementation until we have LLVM >= 13.

(If someone insists on implementing that, he can also consider using inline PTX assembly.)

Apr 27 '22 15:04 yuanming-hu

I will take care of __syncwarp intrinsic.

May 05 '22 07:05 galeselee

I'll take care of __uni_sync.

May 06 '22 23:05 0xzhang

I will take care of __syncwarp intrinsic.

I'm working on match_all.

May 07 '22 03:05 galeselee

Update: Since we are approaching v1.1.0 release, I would like to draw an intermediate summary on this issue.

Thanks to our contributors, the list of warp-level intrinsics has been fully implemented. The milestone has also been achieved, namely using the intrinsics to implement a parallel scan (https://github.com/taichi-dev/taichi_benchmark/blob/main/pbf/src/taichi/scan.py), thanks to @YuCrazing.

As the next step, the following related tasks are planned:

Add more examples to utilize the warp intrinsics
Document the instructions in Taichi docs web
Block level support. i.e., explicit shared memory support @turbo0628
raw_kernel support
HW supported warp intrinsics on NV GPUs

In the long term, we plan provide high-level primitives that are backend-agnostic, and are able to provide abstractions to CUDA warp intrinsics, Vulkan subgroup, Metal SIMD group, cpu vectorization, etc.

Since this issue is meant to address CUDA warp-level intrinsics, maybe we can use another issue to track the progress of the mentioned tasks?

Jul 05 '22 04:07 qiao-bo

Hi, I wanted to know if anyone is working on adding support for the subgroup* operations listed above? I can add support for some of the simple ones (shuffle*) but it'd be great if someone can look into the ballot ops (supported by GL_KHR_shader_subgroup_ballot) as I'm not sure how to implement them (the return type is a uvec4) and need to use them for a project.

Jan 02 '24 15:01 alasin

Hi, I wanted to know if anyone is working on adding support for the subgroup* operations listed above? I can add support for some of the simple ones (shuffle*) but it'd be great if someone can look into the ballot ops (supported by GL_KHR_shader_subgroup_ballot) as I'm not sure how to implement them (the return type is a uvec4) and need to use them for a project.

Maybe you can use a structure similar to how TextureStmt returns vec4...

Jan 02 '24 19:01 bobcao3

Maybe you can use a structure similar to how TextureStmt returns vec4...

Can you share the link to it? I can't find TextureStmt while searching.

Jan 03 '24 15:01 alasin

[RFC] [SIMT] Add CUDA warp-level intrinsics to Taichi

Intro

Implementation plan

List of CUDA warp-level intrinsic

API

Example

Steps and how we collaborate

What we already have

Scaffold code and shfl_down_i32

LLVM -> NVVM -> PTX code path

Milestone

Future steps: making Taichi (kind of) a superset of CUDA!

Appendix: List of higher-level primitives (in Vulkan, Metal, etc. & implements as helpers in CUDA)

Scaffold code and `shfl_down_i32`