dask-cuda
dask-cuda copied to clipboard
Performance regression in cuDF merge benchmark
Running the cuDF benchmark with RAPIDS 22.06 results in the following:
RAPIDS 22.06 cuDF benchmark
$ python dask_cuda/benchmarks/local_cudf_merge.py -d 1,2 -c 100_000_000
2022-06-16 08:21:54,375 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-06-16 08:21:54,382 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
Merge benchmark
-------------------------------
backend | dask
merge type | gpu
rows-per-chunk | 100000000
base-chunks | 2
other-chunks | 2
broadcast | default
protocol | tcp
device(s) | 1,2
rmm-pool | True
frac-match | 0.3
data-processed | 5.96 GiB
================================================================================
Wall-clock | Throughput
--------------------------------------------------------------------------------
20.70 s | 294.80 MiB/s
17.62 s | 346.49 MiB/s
39.32 s | 155.22 MiB/s
================================================================================
Throughput | 265.50 MiB +/- 80.79 MiB
Wall-Clock | 25.88 s +/- 9.59 s
================================================================================
(w1,w2) | 25% 50% 75% (total nbytes)
-------------------------------
(01,02) | 110.55 MiB/s 153.32 MiB/s 187.99 MiB/s (12.85 GiB)
(02,01) | 147.30 MiB/s 173.17 MiB/s 187.13 MiB/s (12.85 GiB)
If we roll back one year, to RAPIDS 21.06 performance was substantially superior:
RAPIDS 21.06 cuDF benchmark
$ python dask_cuda/benchmarks/local_cudf_merge.py -d 1,2 -c 100_000_000
Merge benchmark
-------------------------------
backend | dask
merge type | gpu
rows-per-chunk | 100000000
base-chunks | 2
other-chunks | 2
broadcast | default
protocol | tcp
device(s) | 1,2
rmm-pool | True
frac-match | 0.3
data-processed | 5.96 GiB
===============================
Wall-clock | Throughput
-------------------------------
15.40 s | 396.40 MiB/s
7.35 s | 830.55 MiB/s
8.80 s | 693.83 MiB/s
===============================
(w1,w2) | 25% 50% 75% (total nbytes)
-------------------------------
(01,02) | 325.82 MiB/s 332.85 MiB/s 351.81 MiB/s (12.85 GiB)
(02,01) | 296.46 MiB/s 321.66 MiB/s 333.66 MiB/s (12.85 GiB)
It isn't clear where this comes from, but potential candidates seem like Distributed, cuDF or Dask-CUDA itself.
RAPIDS 21.12 and 22.02 perform better than 21.06. The regression appeared first in 22.04, see results below.
RAPIDS 21.06 cuDF benchmark - 10 iterations
$ python dask_cuda/benchmarks/local_cudf_merge.py -d 1,2 -c 100_000_000 --runs 10
Merge benchmark
-------------------------------
backend | dask
merge type | gpu
rows-per-chunk | 100000000
base-chunks | 2
other-chunks | 2
broadcast | default
protocol | tcp
device(s) | 1,2
rmm-pool | True
frac-match | 0.3
data-processed | 5.96 GiB
===============================
Wall-clock | Throughput
-------------------------------
8.28 s | 737.58 MiB/s
15.94 s | 382.80 MiB/s
16.27 s | 375.17 MiB/s
15.90 s | 383.76 MiB/s
15.54 s | 392.86 MiB/s
15.67 s | 389.50 MiB/s
15.52 s | 393.30 MiB/s
16.02 s | 381.04 MiB/s
7.72 s | 790.71 MiB/s
8.35 s | 730.57 MiB/s
===============================
(w1,w2) | 25% 50% 75% (total nbytes)
-------------------------------
(01,02) | 291.99 MiB/s 317.62 MiB/s 402.70 MiB/s (51.04 GiB)
(02,01) | 295.38 MiB/s 327.51 MiB/s 401.62 MiB/s (51.03 GiB)
RAPIDS 21.12 cuDF benchmark - 10 iterations
$ python dask_cuda/benchmarks/local_cudf_merge.py -d 1,2 -c 100_000_000 --runs 10
distributed.preloading - INFO - Import preload module: dask_cuda.initialize
distributed.preloading - INFO - Import preload module: dask_cuda.initialize
Merge benchmark
-------------------------------
backend | dask
merge type | gpu
rows-per-chunk | 100000000
base-chunks | 2
other-chunks | 2
broadcast | default
protocol | tcp
device(s) | 1,2
rmm-pool | True
frac-match | 0.3
data-processed | 5.96 GiB
===============================
Wall-clock | Throughput
-------------------------------
5.91 s | 1.01 GiB/s
5.72 s | 1.04 GiB/s
11.16 s | 546.74 MiB/s
4.82 s | 1.24 GiB/s
4.87 s | 1.22 GiB/s
4.83 s | 1.23 GiB/s
5.72 s | 1.04 GiB/s
5.78 s | 1.03 GiB/s
5.76 s | 1.03 GiB/s
11.18 s | 546.06 MiB/s
===============================
(w1,w2) | 25% 50% 75% (total nbytes)
-------------------------------
(01,02) | 429.34 MiB/s 509.66 MiB/s 626.34 MiB/s (39.86 GiB)
(02,01) | 419.16 MiB/s 502.99 MiB/s 633.16 MiB/s (39.86 GiB)
RAPIDS 22.02 cuDF benchmark - 10 iterations
$ python dask_cuda/benchmarks/local_cudf_merge.py -d 1,2 -c 100_000_000 --runs 10
distributed.preloading - INFO - Import preload module: dask_cuda.initialize
distributed.preloading - INFO - Import preload module: dask_cuda.initialize
Merge benchmark
-------------------------------
backend | dask
merge type | gpu
rows-per-chunk | 100000000
base-chunks | 2
other-chunks | 2
broadcast | default
protocol | tcp
device(s) | 1,2
rmm-pool | True
frac-match | 0.3
data-processed | 5.96 GiB
================================================================================
Wall-clock | Throughput
--------------------------------------------------------------------------------
4.79 s | 1.24 GiB/s
10.99 s | 555.41 MiB/s
10.05 s | 607.36 MiB/s
10.29 s | 593.14 MiB/s
9.94 s | 614.12 MiB/s
10.37 s | 588.66 MiB/s
4.78 s | 1.25 GiB/s
5.71 s | 1.04 GiB/s
10.13 s | 602.58 MiB/s
4.69 s | 1.27 GiB/s
================================================================================
Throughput | 848.05 MiB +/- 317.55 MiB
Wall-Clock | 8.17 s +/- 2.62 s
================================================================================
(w1,w2) | 25% 50% 75% (total nbytes)
-------------------------------
(01,02) | 428.54 MiB/s 478.27 MiB/s 562.20 MiB/s (48.80 GiB)
(02,01) | 440.11 MiB/s 513.94 MiB/s 562.02 MiB/s (48.80 GiB)
RAPIDS 22.04 cuDF benchmark - 10 iterations
$ python local_cudf_merge.py -d 1,2 -c 100_000_000 --runs 10
2022-06-20 02:01:12,323 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-06-20 02:01:12,325 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
Merge benchmark
-------------------------------
backend | dask
merge type | gpu
rows-per-chunk | 100000000
base-chunks | 2
other-chunks | 2
broadcast | default
protocol | tcp
device(s) | 1,2
rmm-pool | True
frac-match | 0.3
data-processed | 5.96 GiB
================================================================================
Wall-clock | Throughput
--------------------------------------------------------------------------------
49.11 s | 124.29 MiB/s
48.77 s | 125.15 MiB/s
45.11 s | 135.31 MiB/s
44.94 s | 135.81 MiB/s
44.13 s | 138.30 MiB/s
40.67 s | 150.07 MiB/s
48.73 s | 125.25 MiB/s
15.46 s | 394.67 MiB/s
48.26 s | 126.47 MiB/s
44.82 s | 136.17 MiB/s
================================================================================
Throughput | 159.15 MiB +/- 78.88 MiB
Wall-Clock | 43.00 s +/- 9.53 s
================================================================================
(w1,w2) | 25% 50% 75% (total nbytes)
-------------------------------
(01,02) | 94.22 MiB/s 144.61 MiB/s 176.84 MiB/s (55.51 GiB)
(02,01) | 108.26 MiB/s 126.38 MiB/s 144.91 MiB/s (55.51 GiB)
The reason for this behavior is compression. Dask 2022.3.0 (RAPIDS 22.04) depends on lz4, whereas Dask 2022.1.0 (RAPIDS 22.02) doesn't.
Distributed has by default the distributed.comm.compression=auto which ends up picking lz4 when available. Disabling compression entirely incurs in a significantly better bandwidth (~5x), severely reducing total runtime (~10x).
RAPIDS 22.04 (no compression)
$ DASK_DISTRIBUTED__COMM__COMPRESSION=None python dask_cud
a/benchmarks/local_cudf_merge.py -d 1,2 -c 100_000_000
2022-06-21 04:35:28,295 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-06-21 04:35:28,298 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
Merge benchmark
-------------------------------
backend | dask
merge type | gpu
rows-per-chunk | 100000000
base-chunks | 2
other-chunks | 2
broadcast | default
protocol | tcp
device(s) | 1,2
rmm-pool | True
frac-match | 0.3
data-processed | 5.96 GiB
================================================================================
Wall-clock | Throughput
--------------------------------------------------------------------------------
3.96 s | 1.51 GiB/s
4.04 s | 1.47 GiB/s
3.95 s | 1.51 GiB/s
================================================================================
Throughput | 1.50 GiB +/- 16.61 MiB
Wall-Clock | 3.98 s +/- 43.50 ms
================================================================================
(w1,w2) | 25% 50% 75% (total nbytes)
-------------------------------
(01,02) | 655.04 MiB/s 707.73 MiB/s 717.14 MiB/s (10.62 GiB)
(02,01) | 700.51 MiB/s 778.48 MiB/s 819.70 MiB/s (10.62 GiB)
RAPIDS 22.04 (lz4)
$ DASK_DISTRIBUTED__COMM__COMPRESSION=lz4 python dask_cuda
/benchmarks/local_cudf_merge.py -d 1,2 -c 100_000_000
2022-06-21 04:22:57,556 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-06-21 04:22:57,558 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
Merge benchmark
-------------------------------
backend | dask
merge type | gpu
rows-per-chunk | 100000000
base-chunks | 2
other-chunks | 2
broadcast | default
protocol | tcp
device(s) | 1,2
rmm-pool | True
frac-match | 0.3
data-processed | 5.96 GiB
================================================================================
Wall-clock | Throughput
--------------------------------------------------------------------------------
41.16 s | 148.28 MiB/s
44.95 s | 135.77 MiB/s
45.06 s | 135.44 MiB/s
================================================================================
Throughput | 139.83 MiB +/- 5.97 MiB
Wall-Clock | 43.73 s +/- 1.81 s
================================================================================
(w1,w2) | 25% 50% 75% (total nbytes)
-------------------------------
(01,02) | 124.73 MiB/s 151.97 MiB/s 169.86 MiB/s (17.32 GiB)
(02,01) | 115.92 MiB/s 132.03 MiB/s 144.96 MiB/s (17.32 GiB)
@quasiben @jakirkham do you have any ideas or suggestions on the best way to handle this? It feels to me like Dask-CUDA/Dask-cuDF should disable compression by default or find a suitable alternative to the CPU compression algorithms that are available by default.
Good catch @pentschev !
It feels to me like Dask-CUDA/Dask-cuDF should disable compression by default or find a suitable alternative to the CPU compression algorithms that are available by default.
I agree, we should disable compression by default for now. If we want to make compression available, we could use KvikIO's Python bindings of nvCOMP.
That is a good idea @madsbk , is this something we plan adding to Distributed? It would be good to do that and do some testing/profiling.
This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.
Short-term fix disabling compression is in #957.
This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.