dask-cuda
dask-cuda copied to clipboard
Temporarily disable compression for communication protocols
For GPU data, compression is worse rather than better because it provokes device-to-host transfers when they are unnecessary.
This is a short-term fix for #935, in lieu of hooking up GPU-based compression algorithms.
@charlesbluca @beckernick @VibhuJawa @ayushdg @randerzander just so you're aware of this, in case this shows up somehow in any workflows where you may test TCP performance. I believe this should make things faster for Dask-CUDA workflows and not have any negative impact.
Curious how was it determined compression was happening with GPU data? Asking as this is explicitly disabled in Distributed and has been this way for a while. Would be good to get a better understanding about what is happening here
Curious how was it determined compression was happening with GPU data? Asking as this is explicitly disabled in Distributed and has been this way for a while. Would be good to get a better understanding about what is happening here
I can't really explain that, is there any chance we're hitting another condition because TCP will force a D2H/H2D copy? Not sure if you've seen, but this is what this is the comment I left when I found that out: https://github.com/rapidsai/dask-cuda/issues/935#issuecomment-1161639986
TCP currently requires host to device copying regardless of whether there is compression or not. So disabling compression wouldn't fix that. If we have some way to send device objects over TCP, let's discuss how we can enable this in Distributed.
TCP currently requires host to device copying regardless of whether there is compression or not. So disabling compression wouldn't fix that.
Yes, I know, and this is not just currently, but will always be the case as it needs to go over Ethernet because there's no GPUDirectRDMA in that case. But anyway this is not what I was trying to say, apologies if that was unclear. I'm only imagining this is happening because it hits some other path instead of https://github.com/dask/distributed/blob/3551d1574c9cd72d60197cc84dd75702ebcfec54/distributed/protocol/cuda.py#L28 that you mentioned earlier. The thing is lz4
gets installed by default in Dask now, and that's what caused the behavior change, so I can only imagine that CUDA-specific config is being ignored/overriden.
Curious how was it determined compression was happening with GPU data? Asking as this is explicitly disabled in Distributed and has been this way for a while. Would be good to get a better understanding about what is happening here
My commit message might be misleading. I got lost trying to follow where the compression was coming from, so it might not be host/device copies, but rather just that on a fast-ish network compressing big chunks of data is slower than just sending them over the wire.
Gotcha ok that makes more sense. That said, users can still configure this themselves in those cases. Instead of having a different default (based on whether Distributed or Distributed + Dask-CUDA is used), which may be more confusing, why don't we document this and encourage users to explore different settings based on their needs?
Curious how was it determined compression was happening with GPU data? Asking as this is explicitly disabled in Distributed and has been this way for a while. Would be good to get a better understanding about what is happening here
I'll have a look, here are some profiled callgraphs of the two options (compression auto vs compression None)


Curious how was it determined compression was happening with GPU data? Asking as this is explicitly disabled in Distributed and has been this way for a while. Would be good to get a better understanding about what is happening here
I'll have a look, here are some profiled callgraphs of the two options (compression auto vs compression None)
tl;dr: using the TCP protocol never calls cuda_dumps/loads
so it's not "GPU data" any more, and so those overrides don't kick in.
OK, so that explicit disabling only kicks in when to_frames
is called with a serializer
argument that includes "cuda"
. This happens when the UCX comm backend is used which explicitly sets serializers
in write
and read
. In that case, cuda_dumps
is called which produces frames that have device buffers in them (which UCX can handle).
In the tcp case, serializers
is always None
on call, and so tcp.write
calls to_frames
which calls into dask_dumps
(for which there are handlers registered to deal with cudf dataframes and call host_serialize
). But now the header of the message doesn't contain compression
overrides, with the consequence that the host dataframe buffers are now compressed.
Right though disabling compression won't avoid the DtH/HtD transfers in the TCP case.
Compression is allowed in that case since everything is on host. It just follows Distributed's default.
Certainly users can disable this behavior on their own. We can also add this before our own benchmark scripts as well (if that is important to us).
Would caution against setting this as a default in Dask-CUDA because
- It has different behavior than Distributed
- Users will not realize adding Dask-CUDA has changed this default somehow
- This may not be the right choice for other workflows
- It can lead to lengthy debugging by developers building on Dask-CUDA
Here's a recent example of this kind of debugging due to a custom environment variable conda-forge added to scikit-build ( https://github.com/conda-forge/ctng-compiler-activation-feedstock/issues/77 ) ( https://github.com/scikit-build/scikit-build/issues/722 ).
Particularly I find frustrating those defaults that are hard to really know beforehand, the compression default itself is a great example, something changed in Dask (pulling lz4 by default) that had to be debugged so I could understand. So here we would be setting yet-another implicit default that may be difficult to debug too (now in two layers), so I agree with your point John.
I'm ok with just setting that as a default for benchmark scripts, for example, if Lawrence is ok with that too.
rerun tests
Codecov Report
Attention: Patch coverage is 0%
with 1 line
in your changes missing coverage. Please review.
Please upload report for BASE (
branch-22.08@435dae8
). Learn more about missing BASE report.
Files with missing lines | Patch % | Lines |
---|---|---|
dask_cuda/__init__.py | 0.00% | 1 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## branch-22.08 #957 +/- ##
==============================================
Coverage ? 0.00%
==============================================
Files ? 16
Lines ? 2107
Branches ? 0
==============================================
Hits ? 0
Misses ? 2107
Partials ? 0
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
This PR has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d
if there is no activity in the next 60 days.
This PR has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d
if there is no activity in the next 60 days.