dask-cuda
dask-cuda copied to clipboard
Support for CUDA streams
When writing CUDA applications, an important aspect for keeping GPUs busy is the use of streams to enqueue operations asynchronously from the host.
Libraries such as Numba and CuPy offer support for CUDA streams, but today we don't know to what extent they're functional with Dask.
I believe CUDA streams will be beneficial to leverage higher performance, particularly in the case of several small operations, streams may help Dask keeping on dispatching work asynchronously, while GPUs do the work.
We should check what's the correct way of using them with Dask and how/if they provide performance improvements.
cc @mrocklin @jakirkham @jrhemstad
There is no correct way today. Every Python GPU library has their own Python wrapping of the CUDA streams API, so there is no consistent way to do this today. Also cc'ing @kkraus14
FWIW recently stumbled across this bug ( https://github.com/cupy/cupy/issues/2159 ) on the CuPy issue tracker. Not sure if it is directly related, but figured it was worth being aware of that issue.
Possibly related: numba/numba#4797
One potentially interesting idea to consider would be to create streams and store them in thread local storage. This way we could grab the stream for the current thread and apply it to operations within that thread relatively easily. Ideally this would give us the benefits of compiling with --default-stream per-thread
without having to necessarily do that. The question then is where should this functionality live? dask-cuda seems like a reasonable place, but we could also imagine Numba, CuPy, or others being reasonable candidates. Thoughts?
cc @kkraus14
I've never really used --default-stream per-thread
, what's the behavior if you happen to access the same data on multiple threads, do you need explicit synchronization? If so, how would that be handled automatically without requiring user interference?
One potentially interesting idea to consider would be to create streams and store them in thread local storage. This way we could grab the stream for the current thread and apply it to operations within that thread relatively easily. Ideally this would give us the benefits of compiling with
--default-stream per-thread
without having to necessarily do that. The question then is where should this functionality live? dask-cuda seems like a reasonable place, but we could also imagine Numba, CuPy, or others being reasonable candidates. Thoughts?
Just wanna add that CuPy is already doing exactly what John said (thread-local streams), see https://github.com/cupy/cupy/blob/096957541c36d4f2a3f2a453e88456fffde1c8bf/cupy/cuda/stream.pyx#L8-L46.
For @pentschev's question: I think users who write a multi-threading GPU program are responsible to take care of synchronization as they would do in CUDA C/C++. Not sure if there's an easy way out.
By the way, a discussion on per-thread default streams is recently brought up in Numba: numba/numba#5137.
It seems like --default-stream per-thread
could then be helpful if we assign multiple threads per worker (currently we only use one in dask-cuda). My original idea was to have the same thread working on multiple streams, but I think this may be too complex of a use case for Dask. However, multiple threads per worker on the same GPU may limit the problem size due to a potential larger memory footprint, so we need to really test this out and see if this is something that we could work with in Dask.
Thanks for the info Leo! So maybe this is less a question of how should dask-cuda handle this (specifically) and more a question of how we should work together across libraries to achieve a shared stream solution.
I'd also note that --default-stream-per-thread
is currently incompatible with RMM pool mode which is planned to be addressed in the future, but likely wont be for a while.
Updating this issue a bit. It is possible to enable --default-stream-per-thread
with RMM's default CNMeM pool ( https://github.com/rapidsai/rmm/pull/354 ), but it comes with some tradeoffs. There is also a new RMM pool where --default-stream-per-thread
is being worked on in PR ( https://github.com/rapidsai/rmm/pull/425 ).
Asking about PTDS support with CuPy in issue ( https://github.com/cupy/cupy/issues/3755 ).
Also we would need to make a change to Distributed to support PTDS. Have submitted a draft PR ( https://github.com/dask/distributed/pull/4034 ) to make these changes.
I'd also note that
--default-stream-per-thread
is currently incompatible with RMM pool mode which is planned to be addressed in the future, but likely wont be for a while.
Also PR ( https://github.com/rapidsai/rmm/pull/466 ) will switch us to the new RMM pool, which does support PTDS.
I began with some testing and changes to enable PTDS support in https://github.com/rapidsai/rmm/pull/633 and https://github.com/cupy/cupy/pull/4322 .
This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d.
There is active work here though it’s not really tracked in this issue as a few libraries are involved
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
This is ongoing work. See issue ( https://github.com/rapidsai/dask-cuda/issues/517 ) for more up-to-date status
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
This issue has been labeled inactive-90d
due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.