GPU Sum Reduction

An attempt at an optimized GPU sum reduction, based on this talk by NVIDIA's Mark Harris.

Final Results

At the last reasonable point of optimizing the implementation, the sum reduce kernel achieves about 96% of the theoretical memory bandwidth of my laptop's GPU (GeForce GTX 850M), at 13.867 GB/s versus the theoretical 14.4 GB/s. Recall that reduction is constrained mainly by memory bandwidth, since the algorithm is not compute-intensive at all. Thus, as we have acheived an excellent percentage of the theoretical memory bandwidth, we can reasonably stop at this point.

Main Strategies

Process as much data as possible (without affecting algorithm correctness) in shared memory.
Use sequential addressing to get rid of bank conflicts, both on global-to-shared memory data load, and processing the shared memory data.

Other notes

This implementation performs better than Reduction #3 (sequential addressing) and even Reduction #4 (first add during load) in Mark Harris' slides, without going into loop unrolling (beyond Reduction #4).

To be more specific, the shared efficiency (as measured in nvvp / NVIDIA Visual Profiler) is higher in my implementation (96.1%) than Reduction #4 (84.6%).

This most likely owes to the fact that my implementation processes more data in shared memory (2048 items) than Reduction #4 (1024 items).
For some unknown reason (for now), incorporating loop unrolling in the implementation makes the Debug compiled code output the correct values, while the Release compiled code outputs incorrect values.

gpu-sum-reduction
gpu-sum-reduction copied to clipboard

Metadata

GPU Sum Reduction

Final Results

Main Strategies

Other notes

← Metadata

Owner

Metadata

gpu-sum-reduction gpu-sum-reduction copied to clipboard

Metadata

GPU Sum Reduction

Final Results

Main Strategies

Other notes

← Metadata

Owner

Metadata

gpu-sum-reduction
gpu-sum-reduction copied to clipboard