Performance regression in CUDA executor’s callback mode
There is a performance regression for the callback mode of the CUDA executor.
The regression is introduced by commit
b5a7913aad (Fix check in unregister_thread and add register bool, 2020-10-14).
Expected Behavior
Using preceding commit 02cbcf5413 (Fix hipcc bug of returning const ref when accessing non-const array, 2020-09-30), the results of the executor’s performance benchmark are as follows.
$ bin/cuda_executor_throughput_test
[HPX CUBLAS executor benchmark] - Starting...
GPU Device 0: "GeForce RTX 3080" with compute capability 8
MatrixA(4,4), MatrixB(4,4), MatrixC(4,4)
Async host->device copy operation completed
Small matrix multiply tests using CUBLAS...
us per iteration 40.23 : Warmup
us per iteration 32.3667 : Callback based executor
us per iteration 27.0667 : Event polling based executor
Note that the runtime of each iteration does not depend on the total iteration count.
$ bin/cuda_executor_throughput_test --iterations=10000
[HPX CUBLAS executor benchmark] - Starting...
GPU Device 0: "GeForce RTX 3080" with compute capability 8
MatrixA(4,4), MatrixB(4,4), MatrixC(4,4)
Async host->device copy operation completed
Small matrix multiply tests using CUBLAS...
us per iteration 35.19 : Warmup
us per iteration 29.3334 : Callback based executor
us per iteration 16.5664 : Event polling based executor
Actual Behavior
Starting with commit b5a7913aad (Fix check in unregister_thread and add register bool, 2020-10-14), the runtime of the callback based executor is about three times as much. The runtime of the event based executor does not seem to be affected.
$ bin/cuda_executor_throughput_test
[HPX CUBLAS executor benchmark] - Starting...
GPU Device 0: "GeForce RTX 3080" with compute capability 8
MatrixA(4,4), MatrixB(4,4), MatrixC(4,4)
Async host->device copy operation completed
Small matrix multiply tests using CUBLAS...
us per iteration 112.14 : Warmup
us per iteration 99.7333 : Callback based executor
us per iteration 24.1333 : Event polling based executor
Note that the runtime of the callback mode depends on the total iteration count. With more iterations, the average runtime increases for the executor’s callback mode.
$ bin/cuda_executor_throughput_test --iterations=10000
[HPX CUBLAS executor benchmark] - Starting...
GPU Device 0: "GeForce RTX 3080" with compute capability 8
MatrixA(4,4), MatrixB(4,4), MatrixC(4,4)
Async host->device copy operation completed
Small matrix multiply tests using CUBLAS...
us per iteration 115.61 : Warmup
us per iteration 268.099 : Callback based executor
us per iteration 16.6439 : Event polling based executor
Steps to Reproduce the Problem
Compare the runtimes of the CUDA executor between commits 02cbcf5413 and b5a7913aad. Assuming the build directory as working directory, the following steps lead to reproduction.
-
git checkout 02cbcf5413 -
make tests.performance.modules.async_cuda -
bin/cuda_executor_throughput_test --iterations=10000 | tee $(git describe) -
git checkout b5a7913aad -
make tests.performance.modules.async_cuda -
bin/cuda_executor_throughput_test --iterations=10000 | tee $(git describe) -
diff --color --unified 1.5.0-275-g02cbcf5413 1.5.0-276-gb5a7913aad
Note that HPX_WITH_ASYNC_CUDA and HPX_WITH_CUDA are assumed to be ON.
Specifications
CUDA 11.2.2 is used.
The presence of the performance regression has been verified in recent commit 98477a6ac7 (Merge pull request #5348 from STEllAR-GROUP/rename_tag_invoke, 2021-06-01).
$ git describe
1.6.0-644-g98477a6ac7
Used OS is Linux, specifically
$ uname -a
Linux … 5.8.0-50-generic #56~20.04.1-Ubuntu SMP Mon Apr 12 21:46:35 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Used gcc version is
$ gcc --version
gcc (GCC) 10.2.0
Since https://github.com/STEllAR-GROUP/hpx/pull/5383 was merged, this should be fixed. @DarkDeepBlue any chance you could confirm if that is the case?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
@DarkDeepBlue is this still an issue for you?