hipblasdgemm not getting close to peak
What is the expected behavior
- I would expect a dgemm of sizeable input to achieve close to the 47.9 TFLOP/s
What actually happens
- By using the code provided in: https://github.com/JorgeG94/calum_performance_tool it can be seen that rocm/5.4.0 produces 38TFLOP/s
How to reproduce
- The code in the repo https://github.com/JorgeG94/calum_performance_tool has the readme, but basically:
hipcc -L/opt/rocm-5.4.3/lib -lhipblas --offload-arch=gfx90a performance.cpp./a.out 36000 14400 36000 10 T T
Environment
| Hardware | description |
|---|---|
| GPU | MI250x |
| CPU | AMD Optimized 3rd Gen EPYC |
| Software | version |
|---|---|
| ROCM | v5.4.0 |
I've tried larger sizes and at some point the code just breaks without ever breaking the 40 TFLOP barrier
Hi @JorgeG94, thanks for opening this issue.
hipBLAS is just a wrapper library for rocBLAS/cuBLAS backends. rocBLAS then uses the Tensile library for calls to gemm. Since you're looking for better performance in dgemm, I think it will be best if I transfer this issue to the Tensile library where they can hopefully help you out. Performance tuning done there will be realized in rocBLAS and hipBLAS w/ AMD backend.
Thanks, Daine
I will check this on my side. Does the performance drop happen only with this size? Have you checked other sizes and/or orientations?
@JorgeG94 Can you please test with the latest ROCm 6.1.2? If your issue is resolved, please close the ticket. Thanks!