16K fp8 gemm is low performance using ./hipblaslt-bench MI300X
software: 1. docker: rocm/vllm:rocm6.3.1_mi300_ubuntu22.04_py3.12_vllm_0.6.6 2. hipBLASLt: develop branch, commit 402603df at Feb, 28, 2025, hardware: Mi300X fixed to 1005Mhz
when I run /hipBLASLt/build/release/clients/staging# ./hipblaslt-bench -r f8_r -m 16384 -n 16384 -k 16384 performance is: 28 Tflops
when I run hipBLASLt/build/release/clients/staging# ./hipblaslt-bench -r f8_r -m 8192 -n 8192 -k 8192 performance is: 907 Tflops
why 16K fp8gemm performance is so low? How can i improve it?
Hi @Alice1069. Internal ticket has been created to investigate your issue. Thanks!
Hi @Alice1069, this is not entirely unexpected, the benchmark performance will vary based on size. The discrepancy here does seem large, so I'll try to reproduce this. We have some other parameters to pass to improve/optimize the benchmark performance here, although there are no public guides I'm aware of that detail them; I'll try to find a few key ones for you.
hi, schung any method could make the 16K bgemm quicker?
This issue has been migrated to: https://github.com/ROCm/rocm-libraries/issues/320
Closing the issue in this repo. Please refer to the migrated issue for updates.