KernelAbstractions.jl icon indicating copy to clipboard operation
KernelAbstractions.jl copied to clipboard

Auto-tuning workgroupsize when localmem consumption depends on it

Open tkf opened this issue 4 years ago • 7 comments

Does KernelAbstractions.jl support auto-setting workgroupsize when the kernel has local memory size that depends on groupsize? For example, CUDA.launch_configuration takes a shmem callback that maps a number of threads to shared memory used. This is used for implementing mapreduce in CUDA.jl. Since shmem argument for CUDA.launch_configuration is not used in Kernel{CUDADevice}, I guess it's not implemented yet? Is it related to #19?

tkf avatar Feb 21 '21 22:02 tkf

This is #11 KA doesn't support dynamic shared memory.

vchuravy avatar Feb 21 '21 22:02 vchuravy

Does #11 have auto-tuning? I skimmed the code but I couldn't find any. Or it's planned but not implemented?

tkf avatar Feb 21 '21 23:02 tkf

No #11 was started before we added auto-tuning, and stalled since no-one had a clear need for it.

vchuravy avatar Feb 21 '21 23:02 vchuravy

oh, that sounds like I need to give a shot at it if I want it :joy:

I still am not clear how to implement auto-tuning with #11, though. If I write @dynamic_localmem T (workgroupsize) -> expression_with(T, worksize), I also need to have a way to compute T from the arguments to the kernel, which can be arbitrarily complex. Since Cassette operates on untyped IR, isn't it impossible to get T given kernel arguments (types)? Doing this at the macro level is even more hopeless. Also, how about @dynamic_localmem behind an inlinable function call?

If these concerns are legit, maybe we still need the explicit shmem callback-like approach?

tkf avatar Feb 22 '21 00:02 tkf

I'm in particular interested in the use case combined with pre-launch workgroupsize auto-tuning #216.

tkf avatar Feb 22 '21 00:02 tkf