candle icon indicating copy to clipboard operation
candle copied to clipboard

flash attention does not yield speed gains on llama example

Open jorgeantonio21 opened this issue 10 months ago • 1 comments

After trying llama example with either cuda or flash-attn features, I realized the generation times are quite similar. I would expect flash attention to have a significant improvement in the token generation speed (at least, according to the authors of the paper).

I am running these tests on a NVIDIA RTX4090, and running the commmands:

cargo run --release --features flash-attn --example llama -- --use-flash-attn --sample-len 1000

with

47.79993663930067 token/s

and

cargo run --release --features cuda --example llama -- --sample-len 1000

with

47.11094751627298 token/s

I have also experimented with falcon 7b example, and noticed the same lack of speed improvement.

jorgeantonio21 avatar Apr 15 '24 10:04 jorgeantonio21

Sorry for the late reply. I think one issue with measurement here might be that we're including the time to generate the first token which is bounded by the model being loaded in an asynchronous way. I've tweaked it in #2106 so that we only measure the time spent after the first token. With this change I get 68.5 token/s without flash-attn, and 74.4 token/s with flash-attn (on a H100) so not a massive speedup but it seems to have some effect.

LaurentMazare avatar Apr 22 '24 15:04 LaurentMazare