HIP-CPU
HIP-CPU copied to clipboard
Remove incorrect kernel optimization.
Not encountering a barrier inside the first block does not guarantee that we won't see a barrier in the following blocks. The barrier semantics only require the threads within a block to reach the same barriers.
The included test showcases a simplistic example of a kernel that does satisfy the standard barrier semantics but crashes with the current HIP-CPU implementation.
Btw... this actually even improves the performance (at least for all kernels of the performance tests), since until now, all other blocks had to wait for the first block to determine whether or not a barrier is found before being executed in parallel...
Main:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
performance_tests is a Catch v2.13.6 host application.
Run with -? for options
-------------------------------------------------------------------------------
Monte-Carlo PI
-------------------------------------------------------------------------------
/home/joachimm/Projekte/HIP-CPU/tests/benchmarks.cpp:70
...............................................................................
benchmark name samples iterations estimated
mean low mean high mean
std dev low std dev high std dev
-------------------------------------------------------------------------------
CPU 100 1 8.30728 s
76.71 ms 76.0763 ms 77.4913 ms
3.58548 ms 3.01746 ms 4.36021 ms
/home/joachimm/Projekte/HIP-CPU/tests/benchmarks.cpp:77: FAILED:
CHECK( PI == (static_cast<double>(n) / niter) * 4.0 )
with expansion:
Approx( 3.1415926536 ) == 3.1413932
benchmark name samples iterations estimated
mean low mean high mean
std dev low std dev high std dev
-------------------------------------------------------------------------------
HIP-CPU 100 1 1.65979 s
16.5554 ms 16.2211 ms 17.0178 ms
1.98111 ms 1.50049 ms 2.79446 ms
/home/joachimm/Projekte/HIP-CPU/tests/benchmarks.cpp:96: FAILED:
CHECK( PI == (static_cast<double>(n) / niter) * 4.0 )
with expansion:
Approx( 3.1415926536 ) == 3.14065
-------------------------------------------------------------------------------
VADD
-------------------------------------------------------------------------------
/home/joachimm/Projekte/HIP-CPU/tests/benchmarks.cpp:121
...............................................................................
benchmark name samples iterations estimated
mean low mean high mean
std dev low std dev high std dev
-------------------------------------------------------------------------------
CPU 100 1 34.2286 s
291.258 ms 289.239 ms 294.018 ms
11.9337 ms 9.09765 ms 15.9826 ms
HIP-CPU 100 1 29.2442 s
277.307 ms 276.274 ms 278.515 ms
5.67372 ms 4.83868 ms 7.02215 ms
-------------------------------------------------------------------------------
SGEMM
-------------------------------------------------------------------------------
/home/joachimm/Projekte/HIP-CPU/tests/benchmarks.cpp:278
...............................................................................
benchmark name samples iterations estimated
mean low mean high mean
std dev low std dev high std dev
-------------------------------------------------------------------------------
CPU 100 1 14.6344 s
125.54 ms 124.342 ms 126.976 ms
6.64994 ms 5.69296 ms 7.73498 ms
HIP-CPU 100 1 1.24133 m
772.25 ms 764.184 ms 782.212 ms
45.2482 ms 37.4027 ms 62.551 ms
-------------------------------------------------------------------------------
N-Body
-------------------------------------------------------------------------------
/home/joachimm/Projekte/HIP-CPU/tests/benchmarks.cpp:475
...............................................................................
benchmark name samples iterations estimated
mean low mean high mean
std dev low std dev high std dev
-------------------------------------------------------------------------------
CPU 100 1 1.98585 m
1.1932 s 1.1826 s 1.21194 s
70.305 ms 45.3548 ms 106.215 ms
HIP-CPU 0 100 1 25.2277 s
263.542 ms 259.725 ms 267.309 ms
19.3825 ms 17.5549 ms 21.6322 ms
HIP-CPU 1 100 1 26.6871 s
231.13 ms 229.313 ms 233.907 ms
11.2776 ms 8.36791 ms 19.0644 ms
HIP-CPU 2 100 1 1.15331 m
696.486 ms 693.559 ms 699.582 ms
15.3277 ms 13.3719 ms 18.0437 ms
-------------------------------------------------------------------------------
N-Queens
-------------------------------------------------------------------------------
/home/joachimm/Projekte/HIP-CPU/tests/benchmarks.cpp:865
...............................................................................
benchmark name samples iterations estimated
mean low mean high mean
std dev low std dev high std dev
-------------------------------------------------------------------------------
CPU - Naive 100 87679 0 ns
2.48727 ns 2.47303 ns 2.50961 ns
0.0892655 ns 0.0643914 ns 0.128073 ns
CPU - Parallel 100 1 53.4986 ms
520.458 us 508.051 us 534.305 us
66.9775 us 58.8066 us 78.5709 us
GPU - Parallel 100 1 26.3587 ms
240.73 us 228.69 us 259.964 us
76.3305 us 53.3468 us 116.112 us
GPU - Optimised 100 1 28.7426 ms
227.102 us 219.844 us 240.266 us
48.6692 us 30.4105 us 83.7005 us
-------------------------------------------------------------------------------
HAXPY
-------------------------------------------------------------------------------
/home/joachimm/Projekte/HIP-CPU/tests/benchmarks.cpp:910
...............................................................................
benchmark name samples iterations estimated
mean low mean high mean
std dev low std dev high std dev
-------------------------------------------------------------------------------
HAXPY 100 1 10.8723 ms
87.6069 us 84.231 us 92.8906 us
21.1628 us 14.9325 us 29.7977 us
HAXPY-native 100 1 20.4512 ms
166.409 us 162.241 us 171.686 us
23.7347 us 19.2929 us 31.2847 us
===============================================================================
test cases: 6 | 5 passed | 1 failed
assertions: 26956 | 26954 passed | 2 failed
This patch:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
performance_tests is a Catch v2.13.6 host application.
Run with -? for options
-------------------------------------------------------------------------------
Monte-Carlo PI
-------------------------------------------------------------------------------
/home/joachimm/Projekte/HIP-CPU/tests/benchmarks.cpp:70
...............................................................................
benchmark name samples iterations estimated
mean low mean high mean
std dev low std dev high std dev
-------------------------------------------------------------------------------
CPU 100 1 7.32708 s
75.6154 ms 74.9308 ms 76.4694 ms
3.89684 ms 3.2427 ms 4.67625 ms
HIP-CPU 100 1 1.67327 s
14.9517 ms 14.5308 ms 15.3304 ms
2.03369 ms 1.75302 ms 2.46693 ms
/home/joachimm/Projekte/HIP-CPU/tests/benchmarks.cpp:96: FAILED:
CHECK( PI == (static_cast<double>(n) / niter) * 4.0 )
with expansion:
Approx( 3.1415926536 ) == 3.1416468
-------------------------------------------------------------------------------
VADD
-------------------------------------------------------------------------------
/home/joachimm/Projekte/HIP-CPU/tests/benchmarks.cpp:121
...............................................................................
benchmark name samples iterations estimated
mean low mean high mean
std dev low std dev high std dev
-------------------------------------------------------------------------------
CPU 100 1 34.1128 s
301.186 ms 298.573 ms 304.184 ms
14.2489 ms 12.3201 ms 16.7271 ms
HIP-CPU 100 1 27.8238 s
272.3 ms 271.602 ms 273.108 ms
3.80903 ms 3.21205 ms 4.78626 ms
-------------------------------------------------------------------------------
SGEMM
-------------------------------------------------------------------------------
/home/joachimm/Projekte/HIP-CPU/tests/benchmarks.cpp:278
...............................................................................
benchmark name samples iterations estimated
mean low mean high mean
std dev low std dev high std dev
-------------------------------------------------------------------------------
CPU 100 1 14.8577 s
126.228 ms 125.291 ms 127.483 ms
5.50916 ms 4.47272 ms 7.54921 ms
HIP-CPU 100 1 1.0842 m
656.881 ms 651.239 ms 663.306 ms
30.6308 ms 26.1941 ms 36.3297 ms
-------------------------------------------------------------------------------
N-Body
-------------------------------------------------------------------------------
/home/joachimm/Projekte/HIP-CPU/tests/benchmarks.cpp:475
...............................................................................
benchmark name samples iterations estimated
mean low mean high mean
std dev low std dev high std dev
-------------------------------------------------------------------------------
CPU 100 1 2.03542 m
1.14952 s 1.14056 s 1.16179 s
53.18 ms 41.5814 ms 70.4657 ms
HIP-CPU 0 100 1 24.0217 s
221.189 ms 220.5 ms 222.069 ms
3.94917 ms 3.30045 ms 5.08226 ms
HIP-CPU 1 100 1 20.8636 s
210.983 ms 210.021 ms 212.377 ms
5.84877 ms 4.28386 ms 8.19364 ms
HIP-CPU 2 100 1 1.11553 m
658.155 ms 654.364 ms 663.956 ms
23.6414 ms 16.9554 ms 35.3805 ms
-------------------------------------------------------------------------------
N-Queens
-------------------------------------------------------------------------------
/home/joachimm/Projekte/HIP-CPU/tests/benchmarks.cpp:865
...............................................................................
benchmark name samples iterations estimated
mean low mean high mean
std dev low std dev high std dev
-------------------------------------------------------------------------------
CPU - Naive 100 93908 0 ns
2.15085 ns 2.14657 ns 2.15887 ns
0.0288525 ns 0.0188323 ns 0.0493083 ns
CPU - Parallel 100 1 53.3928 ms
486.055 us 473.915 us 503.213 us
72.6388 us 55.2343 us 107.727 us
GPU - Parallel 100 1 24.9036 ms
196.201 us 191.615 us 200.837 us
23.4965 us 20.8985 us 26.7478 us
GPU - Optimised 100 1 26.7999 ms
207.862 us 202.519 us 213.795 us
28.682 us 24.8298 us 35.9623 us
-------------------------------------------------------------------------------
HAXPY
-------------------------------------------------------------------------------
/home/joachimm/Projekte/HIP-CPU/tests/benchmarks.cpp:910
...............................................................................
benchmark name samples iterations estimated
mean low mean high mean
std dev low std dev high std dev
-------------------------------------------------------------------------------
HAXPY 100 1 7.983 ms
71.2243 us 67.5569 us 76.9837 us
23.1313 us 15.8098 us 32.537 us
HAXPY-native 100 1 13.5236 ms
112.074 us 107.964 us 117.169 us
23.1151 us 19.0588 us 28.5149 us
===============================================================================
test cases: 6 | 5 passed | 1 failed
assertions: 26956 | 26955 passed | 1 failed
@fodinabor very nice, thanks, and apologies for the delayed reply. I'm happy to merge this once you've had a chance to go through the comments. Cheers!
Good news :) What comments do you mean?
Merged, thank you ever so much @fodinabor!