low performance : rocSparse on [Radeon Instinct MI25] vs spmv csr on CPU
Hello,
On rocSparse, for a large dataset, after running 10 iterations on GPU i see that gpu performance is still slightly lower than the naive csr implementation on CPU.
Below are the results. You can see the fastest GPU implementation (of the 10 iterations) is still slower than CPU. The time measured on GPU is accounting for hip memcpy + compute. You can find the code here : https://github.com/siddhart92/SPMV/blob/master/rocm_spmv/spmv_csr_largeDataset.cpp
Can someone tell me why is that so? I was hoping to see better results for GPU. Am I missing something here?
To run, you can simply git clone the repo above and do : make : to compile make run : to run on board (assuming you have rocm installed)
... Coordinate sizes: M = 23052 N = 23052 NZ = 583096 Matrix Size :583096 rowsize : 23052 rowptr size : 23053 CPU SPMV duration : 0.608 ms Device: Vega 10 [Radeon Instinct MI25] GPU SPMV duration : 58.947 ms GPU SPMV duration : 0.805 ms GPU SPMV duration : 0.764 ms GPU SPMV duration : 0.762 ms GPU SPMV duration : 0.833 ms GPU SPMV duration : 0.754 ms GPU SPMV duration : 0.758 ms GPU SPMV duration : 0.944 ms GPU SPMV duration : 0.699 ms GPU SPMV duration : 0.702 ms Comparing top function with testbench Computed 0 incorrect results
Thanks, Sid
Not familiar with spmv, I tried to profile the example with rocprof. While "GPU spmv duration" is around 0.8 ms in my tests, rocprof shows the average kernel time is about 0.013 ms.
On a P100 GPU, while "GPU spmv duration" of the cuda version is around 1 ms in my tests, nvprof shows the average kernel time is about 0.028 ms.
@siddhart92 Apologies for the lack of response. Can you please test with latest ROCm 6.0.2 (HIP 6.0.32831)? If resolved, please close ticket. Thanks!