@inbounds not propagating correctly
@inbounds applied against the kernel function definition has no effect.
Additionally, @inbounds does not propagate through function calls within a kernel, for example by calling zip().
The following benchmarks from https://github.com/torrance/AMDGPU-MWE/blob/main/inbounds.jl demonstrate the performance penalty. Note that the 3rd benchmark is likely doubly penalised since the call to zip() isn't inlined.
function @inbounds => @inbounds annotated at function definition
internal @inbounds => @inbounds annotated at lines with indexing operations
using zip() => using a zip() to iterate and index into arrays
Function @inbounds
BenchmarkTools.Trial: 18 samples with 1 evaluation.
Range (min … max): 283.219 ms … 287.235 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 283.964 ms ┊ GC (median): 0.00%
Time (mean ± σ): 284.278 ms ± 874.447 μs ┊ GC (mean ± σ): 0.10% ± 0.29%
▁█
▄▁▁▁▁▁▁▁▄▄██▁▁▁▁▄▄▁▁▄▁▁▁▄▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▁
283 ms Histogram: frequency by time 287 ms <
Memory estimate: 6.21 MiB, allocs estimate: 406760.
Internal @inbounds
BenchmarkTools.Trial: 36 samples with 1 evaluation.
Range (min … max): 141.340 ms … 141.616 ms ┊ GC (min … max): 1.78% … 0.00%
Time (median): 141.471 ms ┊ GC (median): 0.00%
Time (mean ± σ): 141.469 ms ± 69.181 μs ┊ GC (mean ± σ): 0.10% ± 0.42%
▃ ▃▃ ▃ ▃▃ █ ▃
▇▁▁▁▁▁▁▇▇█▁▇▇▁▁▁██▇█▁▁▇▇▁▁▁▁██▇█▇▁▁▁▁▇▁▁█▇▁▇▇▁▁▁▇▁▇▁▇▁▁▁▁▇▁▁▇ ▁
141 ms Histogram: frequency by time 142 ms <
Memory estimate: 3.06 MiB, allocs estimate: 200490.
Using zip()
BenchmarkTools.Trial: 16 samples with 1 evaluation.
Range (min … max): 318.848 ms … 319.049 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 318.942 ms ┊ GC (median): 0.00%
Time (mean ± σ): 318.950 ms ± 61.016 μs ┊ GC (mean ± σ): 0.10% ± 0.28%
▁ ▁ ▁ █▁ ▁ ▁ ▁ ▁ ▁█ ▁ ▁▁
█▁▁▁▁▁▁▁█▁▁▁▁█▁▁▁██▁▁▁█▁█▁▁▁▁▁▁▁█▁▁█▁▁▁▁▁▁▁██▁▁▁▁▁▁▁▁█▁▁▁▁▁██ ▁
319 ms Histogram: frequency by time 319 ms <
Does this work on CUDA? If so, I can take a look at how they do it and try to mirror their implementation.
Does this work on CUDA? If so, I can take a look at how they do it and try to mirror their implementation.
@jpsamaroo In fact you're right, my benchmarking shows it also fails to work with CUDA.jl. The speed is is:
(inline @inbounds) < (function @inbounds) == (no @inbounds) < (zip with @inbounds)
Should this be an issue raised with GPUCompiler? Or...?
Should this be an issue raised with GPUCompiler? Or...?
Yeah, that seems like the play.