llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Investigate PagedAttention KV-cache memory management for faster inference

Open Azeirah opened this issue 1 year ago • 15 comments

New research just came out on using a technique inspired by kernel virtual memory and pages to manage the KV cache.

Results? Way faster inference!

https://vllm.ai/

They claim up to 24x the throughput (measured in requests handled per second) compared to huggingface's transformers library

afbeelding

How?

Inference is bottlenecked by memory, most notably the KV cache. They say the KV cache's most notable features are

  • That it's very large
  • That it's dynamic, size depends on sequence length which is variable. Existing systems waste 60-80% of memory due to fragmentation and over-reservation

PagedAttention is an alternative approach to managing the KV cache which is inspired by virtual memory, pages and blocks. By allocating the space dynamically with this approach, only up to about 4% of memory will be wasted, instead of the aforementioned 60-80.

For further details, better refer to their website and Github.

Azeirah avatar Jun 20 '23 22:06 Azeirah

llama.cpp currently only ever serves one user at a time so this optimization is not applicable.

JohannesGaessler avatar Jun 21 '23 09:06 JohannesGaessler

I assume it would be useful if we want to host the models and have a interface like chat.openai.com?

nivibilla avatar Jun 21 '23 09:06 nivibilla

Yes, for enterprise use where you have one server generating responses for many users in parallel the optimization would be useful.

JohannesGaessler avatar Jun 21 '23 10:06 JohannesGaessler

llama.cpp currently only ever serves one user at a time so this optimization is not applicable.

Oh I wasn't aware this was exclusively for a client-server application, that explains why they measure performance in requests/sec 🥲

Azeirah avatar Jun 21 '23 11:06 Azeirah

this optimization is still applicable as it can save vram usage of kv tensor.

howard0su avatar Jun 21 '23 13:06 howard0su

If we do end up building this for server use and I think that would be a good idea. Then this paging system would be very useful.

nivibilla avatar Jun 21 '23 13:06 nivibilla

Read through the blog and the code. It turns out the paged attention is a way to manage the memory so that the compute kernel doesn't require kv have to be continues. This make it possible that you can have one prompt's kv append by multi output's KVs. like the following

Prompt KV Block ------ Output 1 KV Block
                            ------ Output 2 KV block
                              ....

This is super helpful if your prompt is long and you need to output multi results. This is a purely engineering trick. The change is mainly around the how we manage the KV in VRAM. If we are using CPU, this is even simpler to implement. (simple as list v.s. vector)

howard0su avatar Jun 21 '23 14:06 howard0su

We allocate all the KV memory required for the maximum context length on startup in one block, so we shouldn't have any fragmentation either.

slaren avatar Jun 21 '23 16:06 slaren

@JohannesGaessler Is serving multiple users concurrently or batch inference on the roadmap of llama.cpp?

randxie avatar Jun 25 '23 10:06 randxie

I don't have any plans for it because I don't care about commercial use but I can't speak for the other devs.

JohannesGaessler avatar Jun 25 '23 10:06 JohannesGaessler

Should it not be on the list ?

Today we are talking about chatbots, in 6 months or so, people will start looking for autonomous agents.

Would it not make sense to build a system that can process multiple requests simultaneously and efficiently ?

On Sun, 25 Jun 2023 at 4:20 PM, Rand Xie @.***> wrote:

@JohannesGaessler https://github.com/JohannesGaessler Is serving multiple users concurrently or batch inference on the roadmap of llama.cpp?

— Reply to this email directly, view it on GitHub https://github.com/ggerganov/llama.cpp/issues/1955#issuecomment-1606023542, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXGU4HR4DKJ5ZMVHMQ2T7DXNAJWLANCNFSM6AAAAAAZN5MVXY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

okpatil4u avatar Jun 25 '23 10:06 okpatil4u

Yeah, I think first we need to solve batch inference. It's implemented in babyllama but I'm haven't tried to port it over to the main llama yet

nivibilla avatar Jun 25 '23 11:06 nivibilla

I'm not really concerned with what other people want to use llama.cpp for. I'm implementing things that are useful for me personally first and foremost. And I don't see how I would benefit from batched inference since I only run llama.cpp for myself on my own hardware.

JohannesGaessler avatar Jun 25 '23 12:06 JohannesGaessler

That's fair, batch inference would be useful for me use this at scale. For example if I want to do sentiment analysis for a large dataset or summarisation at scale.

nivibilla avatar Jun 25 '23 13:06 nivibilla

And in this case having a server to handle multiple users at the same time

nivibilla avatar Jun 25 '23 13:06 nivibilla

I have a comparison for the pytorch implementations with and without paging on a single GPU and the gains are significant. My use case is primarily batch inference, so I am not sure about model serving.

WIth a 40 GB A100 GPU

Inference on a vicuna-13B model without paged attention produces 20 tokens / sec Inference on a vicuna-13B model with paged attention produces 190 tokens / sec

So the speedup is almost 10x. Obviously this is a bit skewed because our workload involves using the same initial prompt prefix in a batch inference setting so there might be good reuse of the KV cache which is helped by Paged Attention.

vikigenius avatar Jun 28 '23 16:06 vikigenius

Thanks Vikash. You mentioned in another thread, that there may be some misalignment in terms of understanding, in this thread, on how vllm works. Could you please explain what you meant by it ?

Also, there have been other comments in terms its effect on CPU, GPU and Mac M1/M2 GPU in terms of performance. Could you or someone else shed some light on it ?

okpatil4u avatar Jun 29 '23 08:06 okpatil4u

From what I understand this isn't so much related to multi-user/client-server use case so much as it it is batched inference, which does seem to be a valid use case even for single-user/local apps, depending on the use case

keeganmccallum avatar Jul 04 '23 04:07 keeganmccallum

Wouldn’t the decreased memory requirement (they state that they cut 55% memory usage) be positive when running inference on smaller devices like phones and laptops as well?

chrfalch avatar Jul 07 '23 13:07 chrfalch

Should be useful if there's a large context.

FNsi avatar Jul 09 '23 16:07 FNsi

Both vLLM and lmDeploy have high throughput batch-inference modes with various tricks. Problem is they don't support GGUF.

How complex would it be to port those tricks (KV cache paging, dynamic batching) to llama.cpp?

viktor-ferenczi avatar Sep 10 '23 01:09 viktor-ferenczi

#2813 - still need to implement the non-tricky version.

Related, there's #2969 - also should be a 50% memory use reduction.

KerfuffleV2 avatar Sep 11 '23 04:09 KerfuffleV2

#2813 only covers "same prompt, multiple output", not "multiple prompt, multiple output".

kiratp avatar Sep 11 '23 06:09 kiratp

Would like to voice my support for this, over at the KoboldAI community we had requests for multi-user support and it would also help out our Horde platform which currently benefits from TGI's speed but TGI has poor output for us compared to Llamacpp.

Having Llamacpp be fast for these use cases means multiple communities would begin using it as a general purpose inference server which would be a cool addition for the project (Once multiple requests can be queued up).

henk717 avatar Sep 13 '23 01:09 henk717

I think this feature is important to make llama cpp usage spread even more

tikikun avatar Sep 13 '23 04:09 tikikun

Which one would be easier? Porting performance/throughput tricks into llama.cpp or porting GGUF support into vLLM?

(lmDeploy is out of the picture, since they don't want to support GGUF. They closed the feature request / suggestion ticket, since they want to concentrate on other things.)

viktor-ferenczi avatar Sep 14 '23 20:09 viktor-ferenczi

IMO, implementing the same idea inside llama.cpp is much better. Currently, vllm leverages Pytorch extension to customize the attention kernel. One benefit of llama.cpp is that it gets rid of pytorch and is more friendly to edge deployment.

We can consider porting the kernels in vllm into llama.cpp. It probably requires a certain amount of refactoring in llama.cpp though..

randxie avatar Sep 14 '23 21:09 randxie

#3479

bobqianic avatar Oct 04 '23 22:10 bobqianic

Where is the KVCacheManager implemented, is it on the GPU or host (CPU)?

naik-amey avatar Nov 08 '23 18:11 naik-amey

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions[bot] avatar Apr 10 '24 01:04 github-actions[bot]