llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Bug: Flash Attention performs worse under ROCM

Open Mushoz opened this issue 3 months ago • 46 comments

What happened?

Turning on flash attention degrades the performance when used under ROCM (at least it does with a 7900 xtx). Using batched bench, the degradation is quite minor at a batchsize of 1.

prompt processing: 461 -> 434 token generation: 24.26 -> 23.84

However, when running multiple batches of requests at the same time, the effect is MUCH more pronounced. Especially with batch sizes of 16 the difference is massive:

prompt processing: 678 -> 375 token generation: 169.65 -> 86.87

Flash Attention is needed to be able to use quantization for the KV-cache, but the performance hit is drastic. Can this be fixed?

Name and Version

build: 4123 (2eb76b2a) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

No response

Mushoz avatar Nov 20 '24 19:11 Mushoz