KernelAbstractions.jl icon indicating copy to clipboard operation
KernelAbstractions.jl copied to clipboard

CPU `__thread_run` could loop over CartesianIndices?

Open rafaqz opened this issue 1 year ago • 7 comments

I noticed in Stencils.jl that when I'm using a fast stencil (e.g. 3x3 window summing over a Matrix{Bool}) that the indexing in __thread_run takes longer than actually reading and summing the stencil!

It seems to be because the conversion from linear back to cartesian indices is pretty slow. I'm getting 4ns for N=2, 7ns for N=3 and 11ns for N=4 on my laptop. So there is also a penalty to adding dimensions.

Could we switch the loop to iterating over CartesianIndices directly?

I guess it will make dividing up the array a little messier, and might be slower for really large workloads where an even split of tasks is more important than 7ns per operation. It could have a keyword to choose behaviours.

rafaqz avatar Dec 29 '23 15:12 rafaqz

Might be interesting, I haven't looked into the execution there too closely.

Do you have a benchmark?

vchuravy avatar Dec 29 '23 16:12 vchuravy

Just some Stencils.jl profiles on another machine.

But I can write up a PR and we can benchmark it

rafaqz avatar Dec 29 '23 16:12 rafaqz

If you can contribute it here: https://github.com/JuliaGPU/KernelAbstractions.jl/tree/main/benchmark that would be nice!

vchuravy avatar Dec 29 '23 16:12 vchuravy

Seems its because my workgroup size was 4 - I guess you're expecting much larger workgroups on CPU?

I never totally got my head around what workgroup size means on CPU when the work is divied up before the workgroup anyway. I was guessing it didn't make much difference what the workgroup size was. But this is a case where it does (very small workloads).

rafaqz avatar Dec 29 '23 16:12 rafaqz

I guess its kind of academic if you can get around it with large workgroups. But comparing a workgroup 1 and 64:

using KernelAbstractions
kernel1! = copy_kernel!(CPU(), 1)
kernel64! = copy_kernel!(CPU(), 64)
A = rand(16, 16, 16, 16)
B = rand(16, 16, 16, 16)

Benchmarks:

julia> @btime kernel1!(A, B; ndrange=size(A))
  1.799 ms (99 allocations: 6.80 KiB)

julia> @btime kernel64!(A, B; ndrange=size(A))
  439.169 μs (99 allocations: 6.80 KiB)

And you can see the difference in the profile for 1 vs 64 (left vs right) is all integer div from the linear - cartesian conversion.

2023-12-29-180754_1920x1080

using ProfileView
@profview for i in 1:100 kernel1!(A, B; ndrange=size(A)) end
@profview for i in 1:100 kernel64!(A, B; ndrange=size(A)) end

rafaqz avatar Dec 29 '23 17:12 rafaqz

Yeah, for the CPU I often use a workgroupsize of 1024

vchuravy avatar Jan 14 '24 15:01 vchuravy

I've been wondering if the CPU workgroup size should mean "how much we unroll"

rafaqz avatar Jan 15 '24 20:01 rafaqz