KernelAbstractions.jl icon indicating copy to clipboard operation
KernelAbstractions.jl copied to clipboard

exposing warp-level semantics

Open leios opened this issue 2 years ago • 12 comments

I had a request from a user to use warp-level semantics from CUDA: sync_warp, warpsize, and stuff here: https://cuda.juliagpu.org/stable/api/kernel/#Warp-level-functions.

They seem to be available here: https://rocm.docs.amd.com/projects/rocPRIM/en/latest/warp_ops/index.html, but I don't know where they exist in AMDGPU.jl or how to use them in KA.

They might be available, but I couldn't find "warp" or "wavefront" or anything else in either the AMDGPU or KernelAbstractions docs. I mean, there was this page: https://amdgpu.juliagpu.org/stable/wavefront_ops/ ... but it's a bit sparse ^^

If this is already available in KA, I'm happy to add a bit to the docs explaining how they are used. If it is not available, I guess I need to put some PRs forward for CUDA(kernels), ROC(kernels), and here with the new syntax.

Related discussion: https://github.com/JuliaMolSim/Molly.jl/pull/147

Putting it here because I think I found kinda what I was looking for for AMDGPU here: https://github.com/JuliaGPU/AMDGPU.jl/blob/master/test/device/wavefront.jl

  • wavefrontsize = warpsize
  • wfred = wavefront reduce
  • wfscan = wafecron scan
  • wfany = ???
  • wfail = ???
  • wfsame = ???
  • ??? = warp_sync

leios avatar Sep 08 '23 14:09 leios

There currently is no support in KA for wavefront/warp level programming.

Two immediate questions:

  1. What would the semantics be on the CPU level?
  2. Does Intel and Metal also have primitives.

If the goal is to expose warp level reduce operations, maybe we can get away with defining a workgroup @reduce and leave it up to the backends to implement that reduction efficiently?

x-ref: https://github.com/JuliaGPU/KernelAbstractions.jl/pull/419

vchuravy avatar Sep 08 '23 17:09 vchuravy

I'm struggling to find much at all on warp-level semantics for metal or even OneAPI.

It seems like OpenCL just ignores it(?): https://stackoverflow.com/questions/42259118/is-there-any-guarantee-that-all-of-threads-in-wavefront-opencl-always-synchron

To be honest, I haven't seen an application that really needs the warp_sync and warpsize is often just inferred from the host and passed in.

Here's a question I don't have an answer to. Do other (non-NVIDIA) cords even need sync_warps? I think this comes from the fact that Volta architecture deals with warp divergence differently: https://forums.developer.nvidia.com/t/why-syncwarp-is-necessary-in-undivergent-warp-reduction/209893

Do other architectures (Intel, AMD, Metal), even allow this? I guess they might in the future if they don't already.

That means for the short term that we would need CUDA-specific tooling where sync_warps() does what it's supposed to do for CUDA, but then does nothing for other architectures

leios avatar Sep 09 '23 07:09 leios

Coming back to my question: What's the reason you want to access this functionality?

Generally speaking I don't think warpsize is something we should expose in KA, but there are of course workgroup operations we are missing. Reduction is the core one.

#421 is introducing the notion of a subgroup, but I want to understand the reasoning behind that better.

Exposing functionality for one backend only has the risk that the user writes a let el that is actually not portable.

vchuravy avatar Sep 09 '23 15:09 vchuravy

Reading through pages such as https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/2023-0/sub-groups-and-simd-vectorization.html and https://intel.github.io/llvm-docs/cuda/opencl-subgroup-vs-cuda-crosslane-op.html, I was under the impression that subgroups do map pretty closely to warps/wavefronts? If that's the case, then having a cross-platform abstraction for working with them seems useful.

ToucheSir avatar Sep 09 '23 16:09 ToucheSir

KernelAbstractions is not One API, so the meaning of subgroup needs to be defined clearly and independently.

If often comes down to can we expose these semantics without to much of a performance loss on other hardware? Users are always free to use CUDA.jl directly, but writing a kernel should have a reasonable expectation of performance across all backends.

KernelAbstractions is a common denominator not a superset of behavior.

vchuravy avatar Sep 09 '23 17:09 vchuravy

I would guess the outlier here is Metal (and parallel CPU) then? I think AMD (wavefronts), CUDA (warps), and Intel (subgroups) all have some concept of warp-level operations; however, I agree with @vchuravy here. None of the warp-level semantics seem standardized enough to put them into KA at this time.

What is the plan with #421, though? I mean, if it's already introducing a subgroup, I guess we can use that for the other backends?

On my end, I was trying to do a simple port of: https://github.com/JuliaMolSim/Molly.jl/pull/133 so we could completely remove the CUDA backend.

leios avatar Sep 09 '23 18:09 leios

For the CPU I had long hoped to use SIMD.jl or a compiler pass to perform vectorization.

Would a subgroupsize of 1 be legal?

vchuravy avatar Sep 09 '23 19:09 vchuravy

KernelAbstractions is not One API, so the meaning of subgroup needs to be defined clearly and independently.

Yes, which is why I found the second link interesting. Digging around a bit more turned up some pages from the SYCL spec (1, 2, 3) which appears to be trying to standardize this. I have no idea how integration on the AMD and Nvidia side works (if at all), but perhaps it could serve as inspiration for creating a common denominator interface in KA.

ToucheSir avatar Sep 09 '23 21:09 ToucheSir

It also looks like vulkan is trying to standardize the terminology as well: https://www.khronos.org/blog/vulkan-subgroup-tutorial. Their API is supposed to be similar to OpenCL for compute, but I cannot find such topics in OpenCL.

For me, I can obviously see a use for warp reduce, scan, etc. I also find myself wanting to get warp_size a bunch because (in general) AMD has a warpsize of 64 while NVIDIA has a warpsize of 32. For the CPU, a warpsize of 1 sounds correct, right?

It's just that sync_warp would come with caveats:

  1. It probably doesn't work on a mac
  2. It probably doesn't do anything on parallel CPU
  3. It is probably useless pre-volta

But I don't think it will provide any wrong results on any of these platforms if there are dummy calls. I mean the sync_warp functionality was (is) the default for most GPUs. reduce, scan, etc are all functions that have existed on that level for a while, so... I guess it is fine?

leios avatar Sep 10 '23 05:09 leios

Their API is supposed to be similar to OpenCL for compute, but I cannot find such topics in OpenCL.

It's kind of hidden away and took a while for me to find, but the OpenCL spec does touch on sub-groups (they like the hyphen) in a few places. https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_API.html#_mapping_work_items_onto_an_ndrange introduces them and subsequent sections like https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_API.html#execution-model-sync look relevant. There's also some more info about the actual kernel-level API in the OpenCL C spec, e.g. https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_C.html#subgroup-functions.

ToucheSir avatar Sep 10 '23 16:09 ToucheSir

I think having some "subgroup sync" op would be helpful (it could fall back on a full sync if not)

simonbyrne avatar Sep 21 '23 20:09 simonbyrne

I tend to think of "warps on a CPU" as the SIMD vector size. The semantics are quite similar:

  • SIMD instructions execute all or nothing, with individual SIMD lines possibly disabled (e.g. on AVX512)
  • all SIMD lanes always execute in sync
  • there are certain special instructions for SIMD-wide reductions (e.g. "horizontal add") Thus, when using SIMD instructions on a CPU, it can be useful to know the SIMD vector size used in a kernel.

It might also make sense to let the caller specify (statically) the SIMD vector size for a kernel and pass this as optimization hint to the compiler. Alternatively, the compiler could choose a SIMD vector size statically (depending on the target CPU capabilities).

I understand that the match to "warp size" isn't perfect since the SIMD vector size depends on the type (64bit vs 32bit).

eschnett avatar Dec 23 '23 20:12 eschnett