KernelAbstractions.jl icon indicating copy to clipboard operation
KernelAbstractions.jl copied to clipboard

Lower-level kernel form?

Open anicusan opened this issue 8 months ago • 6 comments

This is mainly to start a conversation around the KA kernel language, as it currently starts accumulating more functionality / cruft; for example, if I want a high-performance kernel as written in raw CUDA C++ (but backend- and type-agnostic and having all the Julia niceties), kernels would start to look like:

@kernel unsafe_indices=true cpu=false inbounds=true function somekernel(arg1, @Const(arg2))
    ...
end

What I'd expect by default - a GPU kernel with comparable performance to CUDA - is not really what the language guides me to by default, as I need to add @kernel unsafe_indices=true cpu=false inbounds=true to get close. Even then, with the recent @synchronize lane checks, we see big performance hits in previously well-performing code (e.g. from 540 ms to 1.54 s for a sum - see issue).

Perhaps this is the point where I should emphasise how much I appreciate KernelAbstractions and the titanic work put into it and the JuliaGPU ecosystem. I hope this post does not come across as sweeping criticism, but a discussion for possible future improvements (of course, here "improvements" being simply my personal opinion based on the work I do - and how I'm using KA for HPC code).

Having followed KA development for a few years now, I understand the constraints that evolved the current KA interface - implicit boundschecks, separate CPU and GPU compilation pipelines, ndrange being, well, a range and not the blocksize and nblocks seen in CUDA, divergent synchronize, etc.

Would there be a possibility for, say, a @rawkernel, with more minimal functionality:

@rawkernel function somekernel(arg1, @const(arg2))
    # Closely mimic the typical GPU API (CUDA, OpenCL) only exposing the local and block indices
    bi = get_group_id(1)
    li = get_local_id(1)
    @synchronize()
    ...
end

# Example syntax to get point across - I don't care much for that now, just the functionality
block_size = 128
somekernel{backend, block_size}(arg1, arg2, nblocks=18)

Or more JuliaGPU-like kernel syntax:

function somekernel(arg1, @const(arg2))
    bi = get_group_id(1)
    li = get_local_id(1)
    @synchronize()
    ...
end

result = @ka backend block_size=128 blocks=18 somekernel(arg1, arg2)

# Or create callable object
fkernel = @ka backend block_size=128 somekernel
fkernel(arg1, arg2, blocks=18)

Which would very closely map to the GPU backend's kernel language; I think this would have a few advantages:

  • Simpler to implement and maintain: e.g. no need to inject divergent synchronization checks.
  • Simpler to transpile to the right GPU backend (maybe even transpile Julia-to-Julia, then let the backend do the work?).
  • Simpler, more concise syntax.
  • More consistent usage with the corresponding JuliaGPU @cuda, @metal, etc. kernels.
  • And most importantly, performance as you'd expect from the equivalent CUDA C++ kernel.

What are your thoughts?

anicusan avatar Mar 05 '25 14:03 anicusan

This is essentially the evolution I have in mind with #562

Aligning KA relatively closely to OpenCL/SPIRV semantics

My milestones for KA 0.10 is essentially

  1. CPU to POCL transition
  2. Finish KernelInstrinsics
  3. add a low-level "launch" interface.

But without touching the kernel language itself. I would then expect users like you to start using the lower level interface directly.

KA v1.0 would then be removing deprecated functionality from KA kernel language

vchuravy avatar Mar 05 '25 15:03 vchuravy

What will the low-level interface look like?

Also, I’m concerned about maintaining the current performance levels as new features are added to KA. The 285% performance regression we experienced with the 0.9.34 semantics change was significant, and it would be great if we could avoid similar impacts in the future.

anicusan avatar Mar 07 '25 18:03 anicusan

What will the low-level interface look like?

Much more like programming OpenCL.

import KernelIntrinsics

function vadd(a, b, c)
    i = KernelIntrinsics.get_global_id()
    @inbounds c[i] = a[i] + b[i]
    return
end

The 285% performance regression we experienced with the 0.9.34 semantics change was significant, and it would be great if we could avoid similar impacts in the future.

Full agreement on this. I have been trying to be very cautious with changes like that, but in this case it was unavoidable to correctly map kernels onto existing GPU architectures.

vchuravy avatar Mar 12 '25 12:03 vchuravy

The 285% performance regression we experienced with the 0.9.34 semantics change was significant, and it would be great if we could avoid similar impacts in the future.

Haven't looked into why yet, but https://github.com/JuliaGPU/KernelAbstractions.jl/pull/564 completely hangs my machine with GaussianSplatting.jl. From the quick logs that I was able to see before the hang, looks like it happens with render kernel. Haven't tried unsafe_indices yet.

pxl-th avatar Mar 13 '25 11:03 pxl-th

Since you don't use global indices you should be able to add unsafe_indicies

vchuravy avatar Mar 13 '25 11:03 vchuravy

I saw also that the kernel now uses malloc intrinsic (thus spawning a hostcall). Do you know why it is so? And why not every kernel now does this (that doesn't use unsafe_indices=true)?

I guess I'm not sure when/why unsafe_indices makes a difference now.

pxl-th avatar Mar 13 '25 11:03 pxl-th