Constant and local memory version of this benchmark
This benchmark only tests the situation of storing data in global memories, would it be sense-making if we benchmark in the same manner but using constant and local memory instead? If it makes sensible, having such versions would be great!
I think I had applied such experiments in the past. Picture gets blurry though as the, let's say local memory access (or shared memory in CUDA terminology) instructions have an impact on the throughput execution of compute instructions. Of course, this again depends on the underlying architecture. Overall, I think it is an interesting idea for experimentation but it's a different story than what this benchmark intends to be.
So, @ekondis you think this benchmark is not applicable to testing for local and constant memories?
It certainly could but it will be more difficult to interpret the results you'll get from it.