Tensile icon indicating copy to clipboard operation
Tensile copied to clipboard

hipblasdgemm not getting close to peak

Open JorgeG94 opened this issue 2 years ago • 4 comments

What is the expected behavior

  • I would expect a dgemm of sizeable input to achieve close to the 47.9 TFLOP/s

What actually happens

  • By using the code provided in: https://github.com/JorgeG94/calum_performance_tool it can be seen that rocm/5.4.0 produces 38TFLOP/s

How to reproduce

  • The code in the repo https://github.com/JorgeG94/calum_performance_tool has the readme, but basically: hipcc -L/opt/rocm-5.4.3/lib -lhipblas --offload-arch=gfx90a performance.cpp ./a.out 36000 14400 36000 10 T T

Environment

Hardware description
GPU MI250x
CPU AMD Optimized 3rd Gen EPYC
Software version
ROCM v5.4.0

JorgeG94 avatar Apr 06 '23 00:04 JorgeG94

I've tried larger sizes and at some point the code just breaks without ever breaking the 40 TFLOP barrier

JorgeG94 avatar Apr 06 '23 00:04 JorgeG94

Hi @JorgeG94, thanks for opening this issue.

hipBLAS is just a wrapper library for rocBLAS/cuBLAS backends. rocBLAS then uses the Tensile library for calls to gemm. Since you're looking for better performance in dgemm, I think it will be best if I transfer this issue to the Tensile library where they can hopefully help you out. Performance tuning done there will be realized in rocBLAS and hipBLAS w/ AMD backend.

Thanks, Daine

daineAMD avatar Apr 06 '23 15:04 daineAMD

I will check this on my side. Does the performance drop happen only with this size? Have you checked other sizes and/or orientations?

nakajee avatar Apr 11 '23 17:04 nakajee

@JorgeG94 Can you please test with the latest ROCm 6.1.2? If your issue is resolved, please close the ticket. Thanks!

ppanchad-amd avatar Jul 15 '24 20:07 ppanchad-amd