llama.cpp
llama.cpp copied to clipboard
Bug: Flash Attention performs worse under ROCM
What happened?
Turning on flash attention degrades the performance when used under ROCM (at least it does with a 7900 xtx). Using batched bench, the degradation is quite minor at a batchsize of 1.
prompt processing: 461 -> 434 token generation: 24.26 -> 23.84
However, when running multiple batches of requests at the same time, the effect is MUCH more pronounced. Especially with batch sizes of 16 the difference is massive:
prompt processing: 678 -> 375 token generation: 169.65 -> 86.87
Flash Attention is needed to be able to use quantization for the KV-cache, but the performance hit is drastic. Can this be fixed?
Name and Version
build: 4123 (2eb76b2a) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
What operating system are you seeing the problem on?
Linux
Relevant log output
No response