Updated hip kernels

Open thomasgibson opened this issue 3 years ago • 0 comments

This PR modifies the HIP kernels and includes an optional compile-time flag to modify how many elements are processed per thread lane. A summary of modifications:

~All kernels have a template parameter elements_per_lane which determines how many elements are processed per lane.~
- ~This can be configured by setting the compile flag DWORDS_PER_LANE. The default value gives the best performance overall~.
- Only Dot and NStream make use of DWORDS_PER_LANE, while the remaining main kernels only contain minor modifications using updated syntax (e.g. hipBlockDim_x -> blockDim.x)
The restrict qualifier has been added to the kernel arguments for NStream, which helps boosts performance slightly.
The number of blocks launched for the dot kernel has been modified.
The partial sums for the dot-kernel is allocated using hipHostMalloc, which allocates in a device-visible page. Memory transfer occurs asynchronously and, as a result, requires a hipDeviceSynchronize after the kernel is called.

Happy to discuss how much of this you want upstream. For reference (before/after) numbers, here are some quick results in double precision:

Reference (baseline - develop): Ran on an MI-210 with the following arguments with an array size of 2^28 elements: -s $((2**28))

Function    MBytes/sec  Min (sec)   Max         Average     
Copy        1406620.269 0.00305     0.00322     0.00308     
Mul         1404059.445 0.00306     0.00316     0.00312     
Add         1279936.374 0.00503     0.00522     0.00515     
Triad       1272501.462 0.00506     0.00524     0.00516     
Dot         745985.464  0.00576     0.00687     0.00626

Updated kernels: Ran using the default value for DWORDS_PER_LANE (4) to determine the launch bounds for the Dot kernel:

Function    MBytes/sec  Min (sec)   Max         Average     
Copy        1406789.357 0.00305     0.00310     0.00307     
Mul         1407020.248 0.00305     0.00622     0.00315     
Add         1279350.509 0.00504     0.00522     0.00515     
Triad       1286205.887 0.00501     0.00525     0.00517     
Dot         1297750.809 0.00331     0.00344     0.00338

Aug 15 '22 13:08 thomasgibson