dask-cuda Support for CUDA streams

When writing CUDA applications, an important aspect for keeping GPUs busy is the use of streams to enqueue operations asynchronously from the host.

Libraries such as Numba and CuPy offer support for CUDA streams, but today we don't know to what extent they're functional with Dask.

I believe CUDA streams will be beneficial to leverage higher performance, particularly in the case of several small operations, streams may help Dask keeping on dispatching work asynchronously, while GPUs do the work.

We should check what's the correct way of using them with Dask and how/if they provide performance improvements.

cc @mrocklin @jakirkham @jrhemstad

Jul 24 '19 19:07 pentschev

There is no correct way today. Every Python GPU library has their own Python wrapping of the CUDA streams API, so there is no consistent way to do this today. Also cc'ing @kkraus14

Aug 30 '19 18:08 mrocklin

FWIW recently stumbled across this bug ( https://github.com/cupy/cupy/issues/2159 ) on the CuPy issue tracker. Not sure if it is directly related, but figured it was worth being aware of that issue.

Sep 19 '19 15:09 jakirkham

Possibly related: numba/numba#4797

Jan 09 '20 16:01 leofang

One potentially interesting idea to consider would be to create streams and store them in thread local storage. This way we could grab the stream for the current thread and apply it to operations within that thread relatively easily. Ideally this would give us the benefits of compiling with --default-stream per-thread without having to necessarily do that. The question then is where should this functionality live? dask-cuda seems like a reasonable place, but we could also imagine Numba, CuPy, or others being reasonable candidates. Thoughts?

cc @kkraus14

Jan 31 '20 19:01 jakirkham

I've never really used --default-stream per-thread, what's the behavior if you happen to access the same data on multiple threads, do you need explicit synchronization? If so, how would that be handled automatically without requiring user interference?

Jan 31 '20 21:01 pentschev

One potentially interesting idea to consider would be to create streams and store them in thread local storage. This way we could grab the stream for the current thread and apply it to operations within that thread relatively easily. Ideally this would give us the benefits of compiling with --default-stream per-thread without having to necessarily do that. The question then is where should this functionality live? dask-cuda seems like a reasonable place, but we could also imagine Numba, CuPy, or others being reasonable candidates. Thoughts?

Just wanna add that CuPy is already doing exactly what John said (thread-local streams), see https://github.com/cupy/cupy/blob/096957541c36d4f2a3f2a453e88456fffde1c8bf/cupy/cuda/stream.pyx#L8-L46.

Feb 07 '20 07:02 leofang

For @pentschev's question: I think users who write a multi-threading GPU program are responsible to take care of synchronization as they would do in CUDA C/C++. Not sure if there's an easy way out.

By the way, a discussion on per-thread default streams is recently brought up in Numba: numba/numba#5137.

Feb 07 '20 08:02 leofang

It seems like --default-stream per-thread could then be helpful if we assign multiple threads per worker (currently we only use one in dask-cuda). My original idea was to have the same thread working on multiple streams, but I think this may be too complex of a use case for Dask. However, multiple threads per worker on the same GPU may limit the problem size due to a potential larger memory footprint, so we need to really test this out and see if this is something that we could work with in Dask.

Feb 07 '20 11:02 pentschev

Thanks for the info Leo! So maybe this is less a question of how should dask-cuda handle this (specifically) and more a question of how we should work together across libraries to achieve a shared stream solution.

Feb 07 '20 20:02 jakirkham

I'd also note that --default-stream-per-thread is currently incompatible with RMM pool mode which is planned to be addressed in the future, but likely wont be for a while.

Feb 08 '20 01:02 kkraus14

Updating this issue a bit. It is possible to enable --default-stream-per-thread with RMM's default CNMeM pool ( https://github.com/rapidsai/rmm/pull/354 ), but it comes with some tradeoffs. There is also a new RMM pool where --default-stream-per-thread is being worked on in PR ( https://github.com/rapidsai/rmm/pull/425 ).

Jul 02 '20 00:07 jakirkham

Asking about PTDS support with CuPy in issue ( https://github.com/cupy/cupy/issues/3755 ).

Aug 10 '20 18:08 jakirkham

Also we would need to make a change to Distributed to support PTDS. Have submitted a draft PR ( https://github.com/dask/distributed/pull/4034 ) to make these changes.

Aug 10 '20 20:08 jakirkham

I'd also note that --default-stream-per-thread is currently incompatible with RMM pool mode which is planned to be addressed in the future, but likely wont be for a while.

Also PR ( https://github.com/rapidsai/rmm/pull/466 ) will switch us to the new RMM pool, which does support PTDS.

Aug 10 '20 20:08 jakirkham

I began with some testing and changes to enable PTDS support in https://github.com/rapidsai/rmm/pull/633 and https://github.com/cupy/cupy/pull/4322 .

Nov 30 '20 13:11 pentschev

This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d.

Feb 16 '21 19:02 github-actions[bot]

There is active work here though it’s not really tracked in this issue as a few libraries are involved

Feb 17 '21 05:02 jakirkham

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Mar 19 '21 05:03 github-actions[bot]

This is ongoing work. See issue ( https://github.com/rapidsai/dask-cuda/issues/517 ) for more up-to-date status

Mar 19 '21 06:03 jakirkham

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Apr 18 '21 08:04 github-actions[bot]

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

Nov 23 '21 20:11 github-actions[bot]

dask-cuda dask-cuda copied to clipboard

Support for CUDA streams

dask-cuda
dask-cuda copied to clipboard