hipBLASLt 16K fp8 gemm is low performance using ./hipblaslt-bench MI300X

software: 1. docker: rocm/vllm:rocm6.3.1_mi300_ubuntu22.04_py3.12_vllm_0.6.6 2. hipBLASLt: develop branch, commit 402603df at Feb, 28, 2025, hardware: Mi300X fixed to 1005Mhz

when I run /hipBLASLt/build/release/clients/staging# ./hipblaslt-bench -r f8_r -m 16384 -n 16384 -k 16384 performance is: 28 Tflops

when I run hipBLASLt/build/release/clients/staging# ./hipblaslt-bench -r f8_r -m 8192 -n 8192 -k 8192 performance is: 907 Tflops

why 16K fp8gemm performance is so low? How can i improve it?

Mar 04 '25 04:03 Alice1069

Hi @Alice1069. Internal ticket has been created to investigate your issue. Thanks!

Mar 04 '25 16:03 ppanchad-amd

Hi @Alice1069, this is not entirely unexpected, the benchmark performance will vary based on size. The discrepancy here does seem large, so I'll try to reproduce this. We have some other parameters to pass to improve/optimize the benchmark performance here, although there are no public guides I'm aware of that detail them; I'll try to find a few key ones for you.

Mar 04 '25 20:03 schung-amd

hi, schung any method could make the 16K bgemm quicker?

Mar 10 '25 03:03 Alice1069

This issue has been migrated to: https://github.com/ROCm/rocm-libraries/issues/320

Jun 20 '25 21:06 assistant-librarian[bot]

Closing the issue in this repo. Please refer to the migrated issue for updates.

Jun 20 '25 21:06 idass1990