Paged Attention
Just found a recent blog https://vllm.ai/ and repo https://github.com/vllm-project/vllm that implements paged attention. Tested this out and it provides massive throughput and memory efficiency improvements.
Can we implement something like this? The paper isn't out yet. But shouldn't Rust be very good at this in theory with it's memory safety guarantees.
Does it have any benefit on cpu-only inference, given that host memory is already paged?
@vikigenius could you please share your benchmarks with vllm vs llama.cpp for gpu ? That will give us some insight into potential speed up.
@okpatil4u I don't have the benchmarks for llama.cpp. I primarily noticed the speed up between the PyTorch implementations with and without paged attention. And there is no reason to think an algorithmic change like that wouldn't translate across languages.
We tested it on NVIDIA A100 GPUs and got significant speedup I will try to get the numbers soon, once we have access to them again.
@okpatil4u got the numbers now. Not a rigorous benchmark, but should still hold up since the gains are so significant.
WIth a 40 GB A100 GPU
Inference on a vicuna-13B model without paged attention produces 20 tokens / sec Inference on a vicuna-13B model with paged attention produces 190 tokens / sec
So the speedup is almost 10x. Obviously this is a bit skewed because our workload involves using the same initial prompt prefix in a batch inference setting so there might be good reuse of the KV cache which is helped by Paged Attention.
Wow, this is amazing. Thanks for postint.
But are you sure if vicuna 13b llama.cpp is benchmarking at 50 ms/token on an A40 gpu ? I would expect it to be a bit faster.
On Wed, 28 Jun 2023 at 9:12 PM, Vikash @.***> wrote:
@okpatil4u https://github.com/okpatil4u got the numbers now. Not a rigorous benchmark, but should still hold up since the gains are so significant.
WIth a 40 GB A100 GPU
Inference on a vicuna-13B model without paged attention produces 20 tokens / sec Inference on a vicuna-13B model with paged attention produces 190 tokens / sec
So the speedup is almost 10x. Obviously this is a bit skewed because our workload involves using the same initial prompt prefix in a batch inference setting so there might be good reuse of the KV cache which is helped by Paged Attention.
— Reply to this email directly, view it on GitHub https://github.com/rustformers/llm/issues/333#issuecomment-1611671873, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXGU4CYRRIDUFY4V4ZOTNLXNRGFFANCNFSM6AAAAAAZU2FPHE . You are receiving this because you were mentioned.Message ID: @.***>
Well as I mentioned before we don't actually use llama.cpp at work in our A100s, so my benchmark numbers are comparing pytorch implementations.
It is possible that at this point llama.cpp itself is a bit better than the PyTorch implementation which might explain the discrepancy.
But given how big the gain is I would expect that if you port Paged Attention to llama.cpp you should see similar gains there as well ?
The discussion here might be relevant https://github.com/ggerganov/llama.cpp/issues/1955 although it seems many people are misunderstanding how the paging works.
It should be hugely beneficial for any batched inference workloads even on a single GPU
Unfortunately, we are likely beholden to what upstream GGML supports, as this would be applied at that layer of the execution. This is something we could potentially implement with #312, but even then we'd need to work with wonnx to support this.
I'll leave this issue open for now, but I don't think we'll see much movement here from our end, sorry :/
Hello, I recently saw ggerganov PR https://github.com/ggerganov/llama.cpp/pull/3228 where he implemented parallel decoding for multiple sequences. Is there any plan on supporting this feature ? This would basically provide a mechanism for doing batch inference 🤔 Thx
Hi, that would be nice to have! I'm not sure if we'll get around to it any time soon as it'll require updating our GGML version and setting up all of the required structures, but I'll see what can be done once we get there.