BabelStream
BabelStream copied to clipboard
Fix performance degradation of HIP dot
The workload of dot calculation is not consistent among the different implementations. The larger the arraysize, the longer it takes for the HIP version to complete.
# hip-stream -n 1500 -s $((1<<30)) | grep Dot
Dot 1376603.333 0.01248 0.01266 0.01251
# cuda-stream -n 1500 -s $((1<<30)) | grep Dot
Dot 1444860.830 0.01189 0.01199 0.01193
The HIP version currently uses arraysize to determine 'dot_num_blocks', which is used as kernel grid size and iteration count for reduction in the host code. The CUDA counterpart uses the number of SM (based on GPU specs) to determine 'dot_num_blocks'. The result should be more reliable with the CUDA one because of higher occupancy and more reasonable overhead of reduction on the host.