KernelAbstractions.jl
KernelAbstractions.jl copied to clipboard
CPU `__thread_run` could loop over CartesianIndices?
I noticed in Stencils.jl that when I'm using a fast stencil (e.g. 3x3 window summing over a Matrix{Bool}) that the indexing in __thread_run takes longer than actually reading and summing the stencil!
It seems to be because the conversion from linear back to cartesian indices is pretty slow. I'm getting 4ns for N=2, 7ns for N=3 and 11ns for N=4 on my laptop. So there is also a penalty to adding dimensions.
Could we switch the loop to iterating over CartesianIndices directly?
I guess it will make dividing up the array a little messier, and might be slower for really large workloads where an even split of tasks is more important than 7ns per operation. It could have a keyword to choose behaviours.
Might be interesting, I haven't looked into the execution there too closely.
Do you have a benchmark?
Just some Stencils.jl profiles on another machine.
But I can write up a PR and we can benchmark it
If you can contribute it here: https://github.com/JuliaGPU/KernelAbstractions.jl/tree/main/benchmark that would be nice!
Seems its because my workgroup size was 4 - I guess you're expecting much larger workgroups on CPU?
I never totally got my head around what workgroup size means on CPU when the work is divied up before the workgroup anyway. I was guessing it didn't make much difference what the workgroup size was. But this is a case where it does (very small workloads).
I guess its kind of academic if you can get around it with large workgroups. But comparing a workgroup 1 and 64:
using KernelAbstractions
kernel1! = copy_kernel!(CPU(), 1)
kernel64! = copy_kernel!(CPU(), 64)
A = rand(16, 16, 16, 16)
B = rand(16, 16, 16, 16)
Benchmarks:
julia> @btime kernel1!(A, B; ndrange=size(A))
1.799 ms (99 allocations: 6.80 KiB)
julia> @btime kernel64!(A, B; ndrange=size(A))
439.169 μs (99 allocations: 6.80 KiB)
And you can see the difference in the profile for 1 vs 64 (left vs right) is all integer div from the linear - cartesian conversion.
using ProfileView
@profview for i in 1:100 kernel1!(A, B; ndrange=size(A)) end
@profview for i in 1:100 kernel64!(A, B; ndrange=size(A)) end
Yeah, for the CPU I often use a workgroupsize of 1024
I've been wondering if the CPU workgroup size should mean "how much we unroll"