hipBLASLt icon indicating copy to clipboard operation
hipBLASLt copied to clipboard

16K fp8 gemm is low performance using ./hipblaslt-bench MI300X

Open Alice1069 opened this issue 10 months ago • 3 comments

software: 1. docker: rocm/vllm:rocm6.3.1_mi300_ubuntu22.04_py3.12_vllm_0.6.6 2. hipBLASLt: develop branch, commit 402603df at Feb, 28, 2025, hardware: Mi300X fixed to 1005Mhz

when I run /hipBLASLt/build/release/clients/staging# ./hipblaslt-bench -r f8_r -m 16384 -n 16384 -k 16384 performance is: 28 Tflops

when I run hipBLASLt/build/release/clients/staging# ./hipblaslt-bench -r f8_r -m 8192 -n 8192 -k 8192 performance is: 907 Tflops

why 16K fp8gemm performance is so low? How can i improve it?

Alice1069 avatar Mar 04 '25 04:03 Alice1069

Hi @Alice1069. Internal ticket has been created to investigate your issue. Thanks!

ppanchad-amd avatar Mar 04 '25 16:03 ppanchad-amd

Hi @Alice1069, this is not entirely unexpected, the benchmark performance will vary based on size. The discrepancy here does seem large, so I'll try to reproduce this. We have some other parameters to pass to improve/optimize the benchmark performance here, although there are no public guides I'm aware of that detail them; I'll try to find a few key ones for you.

schung-amd avatar Mar 04 '25 20:03 schung-amd

hi, schung any method could make the 16K bgemm quicker?

Alice1069 avatar Mar 10 '25 03:03 Alice1069

This issue has been migrated to: https://github.com/ROCm/rocm-libraries/issues/320

Closing the issue in this repo. Please refer to the migrated issue for updates.

idass1990 avatar Jun 20 '25 21:06 idass1990