Extend nvbench to measure SOL for compute-bound workloads
Existing nvbench allows to measure SOL for memory bound workloads by providing
state.addGlobalMemoryReads(nbytes)
state.addGlobalMemoryWrites(nbytes)
It would be useful to extend this concept to provide flops such that we can measure how close workload is to the compute roofline.
We briefly discussed this internally and concluded that there is no good way to provide this feature for the following two reasons:
- There is no easy way to query peak flops for a given GPU architecture.
- The instruction mix may not use floating point at all, leading to a different peak instruction issue rate
We could work around issue 1. and figure out the peak FLOPS for various GPU architectures. However, this still does not address issue 2. that different kinds of compute bound benchmarks will have different theoretical maximums, depending on what kind of instructions they use.
The only valid way would be if users provided the peak metric themselves depending on their benchmark, since only they can know what the bound is (whether that is FLOPs or integer operations, or shuffles, reductions, atomic throughput etc.)
This metrics could also relate to CUPTI metrics.