BabelStream
BabelStream copied to clipboard
Updated hip kernels
This PR modifies the HIP kernels and includes an optional compile-time flag to modify how many elements are processed per thread lane. A summary of modifications:
- ~All kernels have a template parameter
elements_per_lanewhich determines how many elements are processed per lane.~- ~This can be configured by setting the compile flag
DWORDS_PER_LANE. The default value gives the best performance overall~. - Only
DotandNStreammake use ofDWORDS_PER_LANE, while the remaining main kernels only contain minor modifications using updated syntax (e.g.hipBlockDim_x->blockDim.x)
- ~This can be configured by setting the compile flag
- The
restrictqualifier has been added to the kernel arguments for NStream, which helps boosts performance slightly. - The number of blocks launched for the dot kernel has been modified.
- The partial sums for the dot-kernel is allocated using
hipHostMalloc, which allocates in a device-visible page. Memory transfer occurs asynchronously and, as a result, requires ahipDeviceSynchronizeafter the kernel is called.
Happy to discuss how much of this you want upstream. For reference (before/after) numbers, here are some quick results in double precision:
Reference (baseline - develop): Ran on an MI-210 with the following arguments with an array size of 2^28 elements: -s $((2**28))
Function MBytes/sec Min (sec) Max Average
Copy 1406620.269 0.00305 0.00322 0.00308
Mul 1404059.445 0.00306 0.00316 0.00312
Add 1279936.374 0.00503 0.00522 0.00515
Triad 1272501.462 0.00506 0.00524 0.00516
Dot 745985.464 0.00576 0.00687 0.00626
Updated kernels: Ran using the default value for DWORDS_PER_LANE (4) to determine the launch bounds for the Dot kernel:
Function MBytes/sec Min (sec) Max Average
Copy 1406789.357 0.00305 0.00310 0.00307
Mul 1407020.248 0.00305 0.00622 0.00315
Add 1279350.509 0.00504 0.00522 0.00515
Triad 1286205.887 0.00501 0.00525 0.00517
Dot 1297750.809 0.00331 0.00344 0.00338