KernelAbstractions.jl Significant perf drop when using dynamic ranges in GPU kernel

Running the CUDA benchmarks from the HPCBenchmarks.jl tests returns significant performance drop using KA with dynamic range definition. The blow tests are performed on GH200 using local CUDA 12.4 install and Julia 10.2.

Using dynamic ranges ndrange as implemented in the benchmark https://github.com/PTsolvers/HPCBenchmarks.jl/blob/a5985aaaf931efb0caf194d669e3bfcb90c5c08e/CUDA/diffusion_3d.jl#L39:

diffusion_kernel_ka!(CUDABackend(), 256)($A_new, $A, $h; ndrange=($n, $n, $n))

returns a nearly 50% perf drop compared to plain CUDA.jl and reference CUDA C:

[ Info: diffusion 3D
[ Info: N = 256
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "julia" => Trial(104.865 μs)
  "reference" => Trial(92.161 μs)
  "julia-ka" => Trial(173.473 μs)
[ Info: N = 512
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "julia" => Trial(771.301 μs)
  "reference" => Trial(672.581 μs)
  "julia-ka" => Trial(1.299 ms)
[ Info: N = 1024
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "julia" => Trial(6.251 ms)
  "reference" => Trial(5.833 ms)
  "julia-ka" => Trial(10.285 ms)

While modifying it and using static range definition:

diffusion_kernel_ka!(CUDABackend(), 256, ($n, $n, $n))($A_new, $A, $h)

returns

[ Info: diffusion 3D
[ Info: N = 256
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "julia" => Trial(104.993 μs)
  "reference" => Trial(92.416 μs)
  "julia-ka" => Trial(103.649 μs)
[ Info: N = 512
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "julia" => Trial(770.790 μs)
  "reference" => Trial(672.037 μs)
  "julia-ka" => Trial(769.701 μs)
[ Info: N = 1024
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "julia" => Trial(6.250 ms)
  "reference" => Trial(5.873 ms)
  "julia-ka" => Trial(6.121 ms)

Apr 03 '24 15:04 luraess

Yeah this is due to KA allowing for arbitrary dimensions instead of just limiting the user to 3.

You end up in https://github.com/JuliaGPU/CUDA.jl/blob/7f725c0a117c2ba947015f48833630605501fb3a/src/CUDAKernels.jl#L178 and thereafter in https://github.com/JuliaGPU/KernelAbstractions.jl/blob/c5fe83c899b3fd29308564467c3a3722179bfe9d/src/nditeration.jl#L73

So if we don't know the ndrange the code here won't be optimized away and we do execute quite a few integer operations more. Which is particular costly for small kernels.

One avenue I have been meaning to try, but never got around to is to ensure that most of the index calculation occur using Int32

Apr 03 '24 16:04 vchuravy

Can you use CUDA.@device_code dir="out" for both cases kernels? In particular the optimized .ll would be of interest.

Apr 03 '24 16:04 vchuravy

Here are the outputs from the device_code for dynamic (dyn) and static (stat) expressions.

out_dyn.zip out_stat.zip

Apr 04 '24 06:04 luraess

There is a performance pitfall that I didn't expect...

https://github.com/JuliaGPU/KernelAbstractions.jl/blob/c5fe83c899b3fd29308564467c3a3722179bfe9d/src/nditeration.jl#L83

; │┌ @ /srv/scratch/lraess/julia_depot/packages/KernelAbstractions/zPAn3/src/nditeration.jl:84 within `expand`
; ││┌ @ abstractarray.jl:1291 within `getindex`
; │││┌ @ abstractarray.jl:1336 within `_getindex`
; ││││┌ @ abstractarray.jl:1343 within `_to_subscript_indices`
; │││││┌ @ abstractarray.jl:1365 within `_unsafe_ind2sub`
; ││││││┌ @ abstractarray.jl:2962 within `_ind2sub` @ abstractarray.jl:3000
; │││││││┌ @ int.jl:86 within `-`
          %57 = zext i32 %56 to i64, !dbg !280
; │││││││└
; │││││││┌ @ abstractarray.jl:3013 within `_ind2sub_recurse`
; ││││││││┌ @ abstractarray.jl:3020 within `_div`
; │││││││││┌ @ int.jl:295 within `div`
            %58 = udiv i64 %57, %.fca.1.0.0.0.0.extract, !dbg !145
; ││││││││└└
; ││││││││ @ abstractarray.jl:3014 within `_ind2sub_recurse` @ abstractarray.jl:3013
; ││││││││┌ @ abstractarray.jl:3020 within `_div`
; │││││││││┌ @ int.jl:295 within `div`
            %59 = icmp sgt i64 %.fca.1.0.0.1.0.extract, 0, !dbg !281
            br i1 %59, label %pass11, label %fail10, !dbg !281

fail10:                                           ; preds = %pass
            call fastcc void @gpu_report_exception(i64 ptrtoint ([10 x i8]* @exception117 to i64)), !dbg !281
            call fastcc void @gpu_signal_exception({ i64, i32 } %state), !dbg !281
            call void @llvm.trap(), !dbg !281
            call void @llvm.trap(), !dbg !281
            call void asm sideeffect "exit;", ""(), !dbg !281
            unreachable, !dbg !281

pass11:

We have a call to div there which does a check for 0 and otherwise will throw an error. div on it's own is bad enough and I was trying to avoid those in the happy path...

Apr 04 '24 18:04 vchuravy

x-ref: https://github.com/JuliaGPU/GPUArrays.jl/pull/520

Apr 04 '24 18:04 vchuravy

In contrast with constant a ndrange:

│┌ @ /srv/scratch/lraess/julia_depot/packages/KernelAbstractions/zPAn3/src/nditeration.jl:84 within `expand`
; ││┌ @ abstractarray.jl:1291 within `getindex`
; │││┌ @ abstractarray.jl:1336 within `_getindex`
; ││││┌ @ abstractarray.jl:1343 within `_to_subscript_indices`
; │││││┌ @ abstractarray.jl:1365 within `_unsafe_ind2sub`
; ││││││┌ @ abstractarray.jl:2962 within `_ind2sub` @ abstractarray.jl:3000
; │││││││┌ @ int.jl:86 within `-`
          %5 = zext i32 %4 to i64, !dbg !71
; │││││││└
; │││││││┌ @ abstractarray.jl:3013 within `_ind2sub_recurse`
; ││││││││┌ @ abstractarray.jl:3020 within `_div`
; │││││││││┌ @ int.jl:295 within `div`
            %6 = lshr i64 %5, 2, !dbg !89
; ││││││││└└
; ││││││││ @ abstractarray.jl:3014 within `_ind2sub_recurse` @ abstractarray.jl:3013
; ││││││││┌ @ abstractarray.jl:3020 within `_div`
; │││││││││┌ @ int.jl:295 within `div`
            %7 = lshr i64 %5, 12, !dbg !95
; ││││││││└└
; ││││││││ @ abstractarray.jl:3014 within `_ind2sub_recurse` @ abstractarray.jl:3014
; ││││││││┌ @ int.jl:88 within `*`
           %.neg = mul nsw i64 %7, -1024, !dbg !99
; ││││││││└
; ││││││││┌ @ int.jl:86 within `-`
           %8 = add nsw i64 %.neg, %6, !dbg !102
; │││││││└└
; │││││││┌ @ int.jl:86 within `-`
          %9 = zext i32 %3 to i64, !dbg !71
; │││││││└
; │││││││┌ @ abstractarray.jl:3013 within `_ind2sub_recurse`
; ││││││││┌ @ abstractarray.jl:3020 within `_div`
; │││││││││┌ @ int.jl:295 within `div`
            %10 = lshr i64 %9, 8, !dbg !89
; ││││││││└└
; ││││││││ @ abstractarray.jl:3014 within `_ind2sub_recurse`
; ││││││││┌ @ int.jl:86 within `-`
           %11 = and i64 %9, 255, !dbg !103
; ││││││││└
; ││││││││ @ abstractarray.jl:3014 within `_ind2sub_recurse` @ abstractarray.jl:3014 @ abstractarray.jl:3008
; ││││││││┌ @ abstractarray.jl:3018 within `_lookup`
; │││││││││┌ @ int.jl:87 within `+`
            %12 = add nuw nsw i64 %10, 1, !dbg !104
; ││└└└└└└└└

The division is turned into a lshr

Apr 04 '24 19:04 vchuravy

Should one do more globally what was done for Metal in there?

Apr 04 '24 19:04 luraess

I am not sure right now.

We could special case 1D/2D/3D NDRanges
Maybe https://github.com/maleadt/StaticCartesian.jl would help, but in this case we don't have a static set of cartesian indices
The core issue is that we are going using a linear index to a Cartesian, can we get around that without breaking KA tiling
(Low-priority) do indexing math in 32bit
Profiling to see if the issue is the udiv or the exception branch. (The exception branch we could get remove)

Apr 04 '24 19:04 vchuravy

Just a pointer to the relevant Metal implementation of using hardware indices when available: https://github.com/JuliaGPU/Metal.jl/blob/28576b3f4601ed0b32ccc74485cddf9a6f56249c/src/broadcast.jl#L82-L147

Aug 23 '24 12:08 maleadt

KernelAbstractions.jl KernelAbstractions.jl copied to clipboard

Significant perf drop when using dynamic ranges in GPU kernel

KernelAbstractions.jl
KernelAbstractions.jl copied to clipboard