llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Long time until generation starts when using big context

Open CyberTimon opened this issue 2 years ago • 10 comments

When just saying like "Hello, who are you?", I get like 200ms/token and it starts generating almost instantly. On the other hand, when I paste a small text (e.g. search results from duck duck go api) I have to wait +- 1min and then it generates but quite slow. Is this normal behaviour=

My cpu is a ryzen 7 6800h and 32gb ddr5 ram. I'm running vicuna 7b. I paste the search result context from the python bindings.

CyberTimon avatar Apr 09 '23 15:04 CyberTimon

Yes, the self-attention mechanism has quadratic cost. Each input token must be related to every other input token. This means that as context grows, inference speed also drops. Sub-quadratic attention models are current hot topic of research.

alankila avatar Apr 09 '23 16:04 alankila

Is there any recommendations for --batch_size

Eg: I observe that --batch_size 8 - will quickly input tokens from a file sequencially. While --batch_size 32 will enter more tokens from the prompt in a single iteration, however it seems there is a large pause between entering tokens.

I was wondering if there is a number for batch_size at which point you get diminishing returns in terms of speed/efficiency.

Is it better to use a smaller batch_size or a larger one - or does it not really make a difference?

Baaleos avatar Apr 09 '23 16:04 Baaleos

So there is no way at the moment to speed up this process?

CyberTimon avatar Apr 09 '23 16:04 CyberTimon

So there is no way at the moment to speed up this process?

openBLAS (or preferably Intels One library) were said to speed that up. I've not tested it though.

cmp-nct avatar Apr 10 '23 00:04 cmp-nct

Could anyone explain why this is not an issue on GPU?

GPU models load 2k context like it mothing. Difference is entirely incomprehensible.

Ryzen 5950X CPU on 13B model: Output generated in 131.94 seconds (0.15 tokens/s, 20 tokens, context 1481) basically 126s for 1481 tokens self-attention

RTX 3050 GPU on 7B model: Output generated in 6.51 seconds (0.31 tokens/s, 2 tokens, context 1811) basically 6 second for larger context (but smaller model, i can't effectively run 13B)

Anyway it's no way around it. It's about 15-20 times of difference at the very least.

Can we outsource some kind of calculations to GPU? Like make it prepare data in some kind of way that CPU struggles with?

Priestru avatar Apr 10 '23 11:04 Priestru

Can we outsource some kind of calculations to GPU? Like make it prepare data in some kind of way that CPU struggles with?

I was experimenting with hipBlas/cuBlas before but the limitation is that all the weights would need to be copied to GPU memory. Maybe I should try again, the BLAS code had a lot of improvements in the meantime...

SlyEcho avatar Apr 10 '23 16:04 SlyEcho

Textgen webui has option to offload layers on CPU. And what i noticed is that generation speed is absurdly low (like 0.14 t/s), but self-attention is quick

Output generated in 77.97 seconds (0.12 tokens/s, 9 tokens, context 2030)

I measured evaluation by counting seconds and evaluation of 2k token context took 14 seconds. In my case model didn't fit into it's VRAM at all.

Priestru avatar Apr 10 '23 20:04 Priestru

I just spent a couple hours on benchmarking and remembered this issue. There are two major factors that play a role currently:

  1. The entire prompt needs to be processed, this takes a while (as discussed here)
  2. If you use the memory enhanced version using mmap() then the current version will load the model during first inference. This can take a long time depending on your disk and model size. I've a pull request up, it's not fully tested for all cases (or on linux) which will shift that waiting time to the loading area again. (https://github.com/ggerganov/llama.cpp/pull/869) You'll not win any performance that way, but if you use mmap the loading is mixed with inference without that.

cmp-nct avatar Apr 10 '23 23:04 cmp-nct

I apologize if this sounds like a naive question, but I am unsure about how the context is evaluated by the model. From my observation, in interactive mode, the model seems to gradually process all previous input before responding to the latest prompt. I'm curious to know whether it's feasible to provide the model with the previous prompt beforehand so that while the user is typing, the AI can begin preparing its response as part of the ongoing interaction.

Priestru avatar Apr 11 '23 06:04 Priestru

I heard from an issue thread long ago that it's not suppose to behave like this, it should be able to understand context pretty quickly.

Is it possible to fix? Or is it just the law of how CPUs work that it will be this slow forever?

Bloob-beep avatar Apr 12 '23 07:04 Bloob-beep

Closing as this is just how cpu's work

CyberTimon avatar May 21 '23 14:05 CyberTimon