Expected Behavior

Input prompt tokens should load instantly, without having to run inference through the model. The first inference computation should start with the first token after the prompt.

Current Behavior

I might be misunderstanding something, but it seems in the llama.cpp implementation that all the tokens from the input prompt are fed through the model sequentially (in 8-token batches) before any inference of new tokens can take place. This results in a large delay before getting any responses from the model.

One of the big benefits of a transformer model, versus an RNN, is that the entire token context window can be ingested and attended to all at once. But llama.cpp seems to be behaving like an RNN, where each prompt token has to be fed in sequentially first, and the output logits ignored, until finally inference can begin.

Am I just misunderstanding something here?

To show it semi-graphically, a transformer should be able to ingest this on first run:

[I, like, to, eat, -, -, -, -, -, -] -> apples (inferred)

But llama.cpp seems to require:

[I, -, -, -, -, -, -, -, -, -] -> (logits ignored, startup batch) [I, like, -, -, -, -, -, -, -, -] -> (logits ignored, startup batch) [I, like, to, -, -, -, -, -, -, -] -> (logits ignored, startup batch) [I, like, to, eat, -, -, -, -, -, -] -> apples (finally we get to the inference / logits we care about)

Mar 25 '23 00:03 QuadrupleA

The prompt is already ingested in parallel, and you can specify the batch size with -b, however the benefits of doing that on CPU are not as big as on GPU. There is discussion about this at #229.

Mar 25 '23 00:03 slaren

Yeah, that discussion didn't seem to be getting to the (maybe) root of it so thought I'd post here.

My question is, why should the prompt need ingesting at all? Regardless of how is done with batching, paralellism, etc. Can't the model logic just look up the embedding / position vectors of the prompt tokens, feed those to the model, and start predicting the next token right from the start?

Mar 25 '23 03:03 QuadrupleA

You seem to be thinking that a transformer is a function F(tokens[0..=n-1]) -> probs[n]. It isn't. It's a function F(tokens[0..=n-1], probs[0..=n-1]) -> probs[n]:

You need the output probabilities of the network for all prior tokens to predict the next token. So you have to incrementally feed in tokens to build up the context. See "output" as an input on the bottom right of this:

relevant [^1]

You can cheat a little and calculate batches mostly in parallel, but this doesn't help the total work that much. It mainly just helps latency in the case of you having a bunch of free cores - which nicely describes GPUs and TPUs, but not so much CPUs. (This is one of the major advantages over RNNs - you can do a lot more computation in parallel than a typical RNN.)

The above being said, in theory you can cache intermediate state, so you could share prefixes of prompts for example.

[^1]: From wikipedia.

Mar 25 '23 20:03 nonnull-ca

Thanks. So help me clarify - aren't the outputs of the Linear and Softmax layers (top right of the diagram) vectors of length n_vocab, so 32,000 in LLaMA's case? And aren't those randomly sampled to pick the "actual" token to generate? I thought inputs (and prior outputs) are shorter, vectors of length n_embd (4096 for LLaMA 7B). And that the actual token sampled, and not the probs, are fed back into the model, after looking them up in the embedding table and adding positional encoding (bottom right of diagram). Otherwise the model wouldn't have any knowledge of actual words picked right? Just a blurry probability space, and might add words that don't fit the earlier ones.

Appreciate the clarifications, I suspect I'm missing something. Perhaps it's that the q/k/v vectors need to be calculated and cached for each earlier timestep to allow the later ones to infer.

Mar 25 '23 21:03 QuadrupleA

In that case, wouldn't it be possible to cache all probs[N] of the initial prompt after a first run, and reuse them as-is for later runs?

Mar 26 '23 00:03 linouxis9

In that case, wouldn't it be possible to cache all probs[N] of the initial prompt after a first run, and reuse them as-is for later runs?

Yes that is possible, and I use a similar technique in my llamacpp-for-kobold fork. It's not a great approach as even a single word added to beginning of the initial prompt would invalidate the previously computed results.

Mar 26 '23 02:03 LostRuins

Closing this one - on further understanding, the key & value vectors need to be calculated & cached for each token/timestep of the prompt, and each model layer, before new tokens can be generated. So ingesting the prompt is necessary. Like some of the other commenters said, how much that can be sped up on the CPU in llama.cpp is a separate issue.

Mar 26 '23 15:03 QuadrupleA

llama.cpp
llama.cpp copied to clipboard

Initial prompt tokens should be loaded instantly and not require inference

Expected Behavior

Current Behavior

llama.cpp llama.cpp copied to clipboard

Initial prompt tokens should be loaded instantly and not require inference

Expected Behavior

Current Behavior

llama.cpp
llama.cpp copied to clipboard