KernelAbstractions.jl
KernelAbstractions.jl copied to clipboard
Launch kernels and dependencies
KA currently uses a very verbose and explicit dependency management.
event = kernel(CPU())(...)
event = kernel(CPU())(..., dependencies=(event,))
This was added since at the time CUDA.jl used one stream, and thus exposing concurrency was harder.
Now @maleadt added a really nice design around task local streams, allowing users to use Julia tasks to express concurrency on the GPU as well.
So I am thinking that in the interest of reducing the complexity of KA in usage and to align it better with CUDA.jl I would like to remove the dependency management and move to a stream based model.
One open question is how to deal with the CPU (but this could mean we simply move to synchronous execution there, reducing latency as well)
An alternative that I see is to explore an more implicit dependency model based on the arguments to the kernel, I think that would be similar to SYCL or what AMDGPU currently does.
This would be the first step towards KA 1.0
CC interested parties: @glwagner @lcw @jpsamaroo @simonbyrne @kpamnany @omlins
I am all for reducing the complexity of KA.
Would the streams be exposed to the user? For example, would I need to create a copy stream and a compute stream to overlap memory transfers with compute? I am thinking about how to used MPI when we don’t have GPU direct.
You would need to use multiple Julia tasks. One for doing the compute and one for doing the copy/communication.
On Thu, Sep 2, 2021 at 3:45 PM Lucas C Wilcox @.***> wrote:
I am all for reducing the complexity of KA.
Would the streams be exposed to the user? For example, would I need to create a copy stream and a compute stream to overlap memory transfers with compute? I am thinking about how to used MPI when we don’t have GPU direct.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/JuliaGPU/KernelAbstractions.jl/issues/264#issuecomment-911701501, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDO2WTHQN4OCFDQWUS3GDT755YJANCNFSM5DI4CUVQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
I want to recommend using AMDGPU's behavior, but it requires intrusive changes within the GPU array objects to support it, and likely adds some overhead during kernel launch (to search through argument structures and find GPU arrays contained in them to synchronize on), although it's probably not significant since arguments must be concrete.
Anyway, if you do go with the stream model, AMDGPU can be trivially supported by putting barrier packets in the queue immediately following each kernel packet (this effectively turns queues into streams).
So I am thinking that in the interest of reducing the complexity of KA in usage and to align it better with CUDA.jl I would like to remove the dependency management and move to a stream based model.
I would appreciate that 100% :)
fixed by #317