dask-cuda Performance regression in cuDF merge benchmark

Running the cuDF benchmark with RAPIDS 22.06 results in the following:

RAPIDS 22.06 cuDF benchmark

$ python dask_cuda/benchmarks/local_cudf_merge.py -d 1,2 -c 100_000_000
2022-06-16 08:21:54,375 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-06-16 08:21:54,382 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
Merge benchmark
-------------------------------
backend        | dask
merge type     | gpu
rows-per-chunk | 100000000
base-chunks    | 2
other-chunks   | 2
broadcast      | default
protocol       | tcp
device(s)      | 1,2
rmm-pool       | True
frac-match     | 0.3
data-processed | 5.96 GiB
================================================================================
Wall-clock     | Throughput
--------------------------------------------------------------------------------
20.70 s        | 294.80 MiB/s
17.62 s        | 346.49 MiB/s
39.32 s        | 155.22 MiB/s
================================================================================
Throughput     | 265.50 MiB +/- 80.79 MiB
Wall-Clock     | 25.88 s +/- 9.59 s
================================================================================
(w1,w2)        | 25% 50% 75% (total nbytes)
-------------------------------
(01,02)        | 110.55 MiB/s 153.32 MiB/s 187.99 MiB/s (12.85 GiB)
(02,01)        | 147.30 MiB/s 173.17 MiB/s 187.13 MiB/s (12.85 GiB)

If we roll back one year, to RAPIDS 21.06 performance was substantially superior:

RAPIDS 21.06 cuDF benchmark

$ python dask_cuda/benchmarks/local_cudf_merge.py -d 1,2 -c 100_000_000
Merge benchmark
-------------------------------
backend        | dask
merge type     | gpu
rows-per-chunk | 100000000
base-chunks    | 2
other-chunks   | 2
broadcast      | default
protocol       | tcp
device(s)      | 1,2
rmm-pool       | True
frac-match     | 0.3
data-processed | 5.96 GiB
===============================
Wall-clock     | Throughput
-------------------------------
15.40 s        | 396.40 MiB/s
7.35 s         | 830.55 MiB/s
8.80 s         | 693.83 MiB/s
===============================
(w1,w2)     | 25% 50% 75% (total nbytes)
-------------------------------
(01,02)     | 325.82 MiB/s 332.85 MiB/s 351.81 MiB/s (12.85 GiB)
(02,01)     | 296.46 MiB/s 321.66 MiB/s 333.66 MiB/s (12.85 GiB)

It isn't clear where this comes from, but potential candidates seem like Distributed, cuDF or Dask-CUDA itself.

Jun 16 '22 15:06 pentschev

RAPIDS 21.12 and 22.02 perform better than 21.06. The regression appeared first in 22.04, see results below.

RAPIDS 21.06 cuDF benchmark - 10 iterations

$ python dask_cuda/benchmarks/local_cudf_merge.py -d 1,2 -c 100_000_000 --runs 10
Merge benchmark
-------------------------------
backend        | dask
merge type     | gpu
rows-per-chunk | 100000000
base-chunks    | 2
other-chunks   | 2
broadcast      | default
protocol       | tcp
device(s)      | 1,2
rmm-pool       | True
frac-match     | 0.3
data-processed | 5.96 GiB
===============================
Wall-clock     | Throughput
-------------------------------
8.28 s         | 737.58 MiB/s
15.94 s        | 382.80 MiB/s
16.27 s        | 375.17 MiB/s
15.90 s        | 383.76 MiB/s
15.54 s        | 392.86 MiB/s
15.67 s        | 389.50 MiB/s
15.52 s        | 393.30 MiB/s
16.02 s        | 381.04 MiB/s
7.72 s         | 790.71 MiB/s
8.35 s         | 730.57 MiB/s
===============================
(w1,w2)     | 25% 50% 75% (total nbytes)
-------------------------------
(01,02)     | 291.99 MiB/s 317.62 MiB/s 402.70 MiB/s (51.04 GiB)
(02,01)     | 295.38 MiB/s 327.51 MiB/s 401.62 MiB/s (51.03 GiB)

RAPIDS 21.12 cuDF benchmark - 10 iterations

$ python dask_cuda/benchmarks/local_cudf_merge.py -d 1,2 -c 100_000_000 --runs 10
distributed.preloading - INFO - Import preload module: dask_cuda.initialize
distributed.preloading - INFO - Import preload module: dask_cuda.initialize
Merge benchmark
-------------------------------
backend        | dask
merge type     | gpu
rows-per-chunk | 100000000
base-chunks    | 2
other-chunks   | 2
broadcast      | default
protocol       | tcp
device(s)      | 1,2
rmm-pool       | True
frac-match     | 0.3
data-processed | 5.96 GiB
===============================
Wall-clock     | Throughput
-------------------------------
5.91 s         | 1.01 GiB/s
5.72 s         | 1.04 GiB/s
11.16 s        | 546.74 MiB/s
4.82 s         | 1.24 GiB/s
4.87 s         | 1.22 GiB/s
4.83 s         | 1.23 GiB/s
5.72 s         | 1.04 GiB/s
5.78 s         | 1.03 GiB/s
5.76 s         | 1.03 GiB/s
11.18 s        | 546.06 MiB/s
===============================
(w1,w2)        | 25% 50% 75% (total nbytes)
-------------------------------
(01,02)        | 429.34 MiB/s 509.66 MiB/s 626.34 MiB/s (39.86 GiB)
(02,01)        | 419.16 MiB/s 502.99 MiB/s 633.16 MiB/s (39.86 GiB)

RAPIDS 22.02 cuDF benchmark - 10 iterations

$ python dask_cuda/benchmarks/local_cudf_merge.py -d 1,2 -c 100_000_000 --runs 10
distributed.preloading - INFO - Import preload module: dask_cuda.initialize
distributed.preloading - INFO - Import preload module: dask_cuda.initialize
Merge benchmark
-------------------------------
backend        | dask
merge type     | gpu
rows-per-chunk | 100000000
base-chunks    | 2
other-chunks   | 2
broadcast      | default
protocol       | tcp
device(s)      | 1,2
rmm-pool       | True
frac-match     | 0.3
data-processed | 5.96 GiB
================================================================================
Wall-clock     | Throughput
--------------------------------------------------------------------------------
4.79 s         | 1.24 GiB/s
10.99 s        | 555.41 MiB/s
10.05 s        | 607.36 MiB/s
10.29 s        | 593.14 MiB/s
9.94 s         | 614.12 MiB/s
10.37 s        | 588.66 MiB/s
4.78 s         | 1.25 GiB/s
5.71 s         | 1.04 GiB/s
10.13 s        | 602.58 MiB/s
4.69 s         | 1.27 GiB/s
================================================================================
Throughput     | 848.05 MiB +/- 317.55 MiB
Wall-Clock     | 8.17 s +/- 2.62 s
================================================================================
(w1,w2)        | 25% 50% 75% (total nbytes)
-------------------------------
(01,02)        | 428.54 MiB/s 478.27 MiB/s 562.20 MiB/s (48.80 GiB)
(02,01)        | 440.11 MiB/s 513.94 MiB/s 562.02 MiB/s (48.80 GiB)

RAPIDS 22.04 cuDF benchmark - 10 iterations

$ python local_cudf_merge.py -d 1,2 -c 100_000_000 --runs 10
2022-06-20 02:01:12,323 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-06-20 02:01:12,325 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
Merge benchmark
-------------------------------
backend        | dask
merge type     | gpu
rows-per-chunk | 100000000
base-chunks    | 2
other-chunks   | 2
broadcast      | default
protocol       | tcp
device(s)      | 1,2
rmm-pool       | True
frac-match     | 0.3
data-processed | 5.96 GiB
================================================================================
Wall-clock     | Throughput
--------------------------------------------------------------------------------
49.11 s        | 124.29 MiB/s
48.77 s        | 125.15 MiB/s
45.11 s        | 135.31 MiB/s
44.94 s        | 135.81 MiB/s
44.13 s        | 138.30 MiB/s
40.67 s        | 150.07 MiB/s
48.73 s        | 125.25 MiB/s
15.46 s        | 394.67 MiB/s
48.26 s        | 126.47 MiB/s
44.82 s        | 136.17 MiB/s
================================================================================
Throughput     | 159.15 MiB +/- 78.88 MiB
Wall-Clock     | 43.00 s +/- 9.53 s
================================================================================
(w1,w2)        | 25% 50% 75% (total nbytes)
-------------------------------
(01,02)        | 94.22 MiB/s 144.61 MiB/s 176.84 MiB/s (55.51 GiB)
(02,01)        | 108.26 MiB/s 126.38 MiB/s 144.91 MiB/s (55.51 GiB)

Jun 20 '22 11:06 pentschev

The reason for this behavior is compression. Dask 2022.3.0 (RAPIDS 22.04) depends on lz4, whereas Dask 2022.1.0 (RAPIDS 22.02) doesn't.

Distributed has by default the distributed.comm.compression=auto which ends up picking lz4 when available. Disabling compression entirely incurs in a significantly better bandwidth (~5x), severely reducing total runtime (~10x).

RAPIDS 22.04 (no compression)

$ DASK_DISTRIBUTED__COMM__COMPRESSION=None python dask_cud
a/benchmarks/local_cudf_merge.py -d 1,2 -c 100_000_000
2022-06-21 04:35:28,295 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-06-21 04:35:28,298 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
Merge benchmark
-------------------------------
backend        | dask
merge type     | gpu
rows-per-chunk | 100000000
base-chunks    | 2
other-chunks   | 2
broadcast      | default
protocol       | tcp
device(s)      | 1,2
rmm-pool       | True
frac-match     | 0.3
data-processed | 5.96 GiB
================================================================================
Wall-clock     | Throughput
--------------------------------------------------------------------------------
3.96 s         | 1.51 GiB/s
4.04 s         | 1.47 GiB/s
3.95 s         | 1.51 GiB/s
================================================================================
Throughput     | 1.50 GiB +/- 16.61 MiB
Wall-Clock     | 3.98 s +/- 43.50 ms
================================================================================
(w1,w2)        | 25% 50% 75% (total nbytes)
-------------------------------
(01,02)        | 655.04 MiB/s 707.73 MiB/s 717.14 MiB/s (10.62 GiB)
(02,01)        | 700.51 MiB/s 778.48 MiB/s 819.70 MiB/s (10.62 GiB)

RAPIDS 22.04 (lz4)

$ DASK_DISTRIBUTED__COMM__COMPRESSION=lz4 python dask_cuda
/benchmarks/local_cudf_merge.py -d 1,2 -c 100_000_000
2022-06-21 04:22:57,556 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-06-21 04:22:57,558 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
Merge benchmark
-------------------------------
backend        | dask
merge type     | gpu
rows-per-chunk | 100000000
base-chunks    | 2
other-chunks   | 2
broadcast      | default
protocol       | tcp
device(s)      | 1,2
rmm-pool       | True
frac-match     | 0.3
data-processed | 5.96 GiB
================================================================================
Wall-clock     | Throughput
--------------------------------------------------------------------------------
41.16 s        | 148.28 MiB/s
44.95 s        | 135.77 MiB/s
45.06 s        | 135.44 MiB/s
================================================================================
Throughput     | 139.83 MiB +/- 5.97 MiB
Wall-Clock     | 43.73 s +/- 1.81 s
================================================================================
(w1,w2)        | 25% 50% 75% (total nbytes)
-------------------------------
(01,02)        | 124.73 MiB/s 151.97 MiB/s 169.86 MiB/s (17.32 GiB)
(02,01)        | 115.92 MiB/s 132.03 MiB/s 144.96 MiB/s (17.32 GiB)

@quasiben @jakirkham do you have any ideas or suggestions on the best way to handle this? It feels to me like Dask-CUDA/Dask-cuDF should disable compression by default or find a suitable alternative to the CPU compression algorithms that are available by default.

Jun 21 '22 11:06 pentschev

Good catch @pentschev !

It feels to me like Dask-CUDA/Dask-cuDF should disable compression by default or find a suitable alternative to the CPU compression algorithms that are available by default.

I agree, we should disable compression by default for now. If we want to make compression available, we could use KvikIO's Python bindings of nvCOMP.

Jun 21 '22 14:06 madsbk

That is a good idea @madsbk , is this something we plan adding to Distributed? It would be good to do that and do some testing/profiling.

Jun 21 '22 15:06 pentschev

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Jul 21 '22 16:07 github-actions[bot]

Short-term fix disabling compression is in #957.

Jul 21 '22 18:07 wence-

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Aug 20 '22 20:08 github-actions[bot]

dask-cuda dask-cuda copied to clipboard

Performance regression in cuDF merge benchmark

dask-cuda
dask-cuda copied to clipboard