BabelStream Fix performance degradation of HIP dot

Fix performance degradation of HIP dot

Open ddmatsu opened this issue 7 months ago • 0 comments

The workload of dot calculation is not consistent among the different implementations. The larger the arraysize, the longer it takes for the HIP version to complete.

# hip-stream -n 1500 -s $((1<<30)) | grep Dot
Dot         1376603.333 0.01248     0.01266     0.01251
# cuda-stream -n 1500 -s $((1<<30)) | grep Dot
Dot         1444860.830 0.01189     0.01199     0.01193

The HIP version currently uses arraysize to determine 'dot_num_blocks', which is used as kernel grid size and iteration count for reduction in the host code. The CUDA counterpart uses the number of SM (based on GPU specs) to determine 'dot_num_blocks'. The result should be more reliable with the CUDA one because of higher occupancy and more reasonable overhead of reduction on the host.

Jul 02 '24 02:07 ddmatsu

BabelStream BabelStream copied to clipboard

Fix performance degradation of HIP dot

BabelStream
BabelStream copied to clipboard