Tensile hipblasdgemm not getting close to peak

What is the expected behavior

I would expect a dgemm of sizeable input to achieve close to the 47.9 TFLOP/s

What actually happens

By using the code provided in: https://github.com/JorgeG94/calum_performance_tool it can be seen that rocm/5.4.0 produces 38TFLOP/s

How to reproduce

The code in the repo https://github.com/JorgeG94/calum_performance_tool has the readme, but basically: hipcc -L/opt/rocm-5.4.3/lib -lhipblas --offload-arch=gfx90a performance.cpp ./a.out 36000 14400 36000 10 T T

Environment

Hardware	description
GPU	MI250x
CPU	AMD Optimized 3rd Gen EPYC

Software	version
ROCM	v5.4.0

Apr 06 '23 00:04 JorgeG94

I've tried larger sizes and at some point the code just breaks without ever breaking the 40 TFLOP barrier

Apr 06 '23 00:04 JorgeG94

Hi @JorgeG94, thanks for opening this issue.

hipBLAS is just a wrapper library for rocBLAS/cuBLAS backends. rocBLAS then uses the Tensile library for calls to gemm. Since you're looking for better performance in dgemm, I think it will be best if I transfer this issue to the Tensile library where they can hopefully help you out. Performance tuning done there will be realized in rocBLAS and hipBLAS w/ AMD backend.

Thanks, Daine

Apr 06 '23 15:04 daineAMD

I will check this on my side. Does the performance drop happen only with this size? Have you checked other sizes and/or orientations?

Apr 11 '23 17:04 nakajee

@JorgeG94 Can you please test with the latest ROCm 6.1.2? If your issue is resolved, please close the ticket. Thanks!

Jul 15 '24 20:07 ppanchad-amd