hpx Performance regression in CUDA executor’s callback mode

There is a performance regression for the callback mode of the CUDA executor.

The regression is introduced by commit b5a7913aad (Fix check in unregister_thread and add register bool, 2020-10-14).

Expected Behavior

Using preceding commit 02cbcf5413 (Fix hipcc bug of returning const ref when accessing non-const array, 2020-09-30), the results of the executor’s performance benchmark are as follows.

$ bin/cuda_executor_throughput_test
[HPX CUBLAS executor benchmark] - Starting...
GPU Device 0: "GeForce RTX 3080" with compute capability 8
MatrixA(4,4), MatrixB(4,4), MatrixC(4,4)

Async host->device copy operation completed

Small matrix multiply tests using CUBLAS...

us per iteration 40.23 : Warmup

us per iteration 32.3667 : Callback based executor

us per iteration 27.0667 : Event polling based executor

Note that the runtime of each iteration does not depend on the total iteration count.

$ bin/cuda_executor_throughput_test --iterations=10000
[HPX CUBLAS executor benchmark] - Starting...
GPU Device 0: "GeForce RTX 3080" with compute capability 8
MatrixA(4,4), MatrixB(4,4), MatrixC(4,4)

Async host->device copy operation completed

Small matrix multiply tests using CUBLAS...

us per iteration 35.19 : Warmup

us per iteration 29.3334 : Callback based executor

us per iteration 16.5664 : Event polling based executor

Actual Behavior

Starting with commit b5a7913aad (Fix check in unregister_thread and add register bool, 2020-10-14), the runtime of the callback based executor is about three times as much. The runtime of the event based executor does not seem to be affected.

$ bin/cuda_executor_throughput_test
[HPX CUBLAS executor benchmark] - Starting...
GPU Device 0: "GeForce RTX 3080" with compute capability 8
MatrixA(4,4), MatrixB(4,4), MatrixC(4,4)

Async host->device copy operation completed

Small matrix multiply tests using CUBLAS...

us per iteration 112.14 : Warmup

us per iteration 99.7333 : Callback based executor

us per iteration 24.1333 : Event polling based executor

Note that the runtime of the callback mode depends on the total iteration count. With more iterations, the average runtime increases for the executor’s callback mode.

$ bin/cuda_executor_throughput_test --iterations=10000
[HPX CUBLAS executor benchmark] - Starting...
GPU Device 0: "GeForce RTX 3080" with compute capability 8
MatrixA(4,4), MatrixB(4,4), MatrixC(4,4)

Async host->device copy operation completed

Small matrix multiply tests using CUBLAS...

us per iteration 115.61 : Warmup

us per iteration 268.099 : Callback based executor

us per iteration 16.6439 : Event polling based executor

Steps to Reproduce the Problem

Compare the runtimes of the CUDA executor between commits 02cbcf5413 and b5a7913aad. Assuming the build directory as working directory, the following steps lead to reproduction.

git checkout 02cbcf5413
make tests.performance.modules.async_cuda
bin/cuda_executor_throughput_test --iterations=10000 | tee $(git describe)
git checkout b5a7913aad
make tests.performance.modules.async_cuda
bin/cuda_executor_throughput_test --iterations=10000 | tee $(git describe)
diff --color --unified 1.5.0-275-g02cbcf5413 1.5.0-276-gb5a7913aad

Note that HPX_WITH_ASYNC_CUDA and HPX_WITH_CUDA are assumed to be ON.

Specifications

CUDA 11.2.2 is used.

The presence of the performance regression has been verified in recent commit 98477a6ac7 (Merge pull request #5348 from STEllAR-GROUP/rename_tag_invoke, 2021-06-01).

$ git describe
1.6.0-644-g98477a6ac7

Used OS is Linux, specifically

$ uname -a
Linux … 5.8.0-50-generic #56~20.04.1-Ubuntu SMP Mon Apr 12 21:46:35 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Used gcc version is

$ gcc --version
gcc (GCC) 10.2.0

Jun 02 '21 16:06 DarkDeepBlue

Since https://github.com/STEllAR-GROUP/hpx/pull/5383 was merged, this should be fixed. @DarkDeepBlue any chance you could confirm if that is the case?

Oct 05 '21 15:10 msimberg

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Apr 16 '22 11:04 stale[bot]

@DarkDeepBlue is this still an issue for you?

Apr 16 '22 14:04 hkaiser