Make the permanent prompt permanent
Expected Behavior
n_keep tokens (the params.prompt (e.g. alpaca.txt)) are always part of the context and does not need to be recalculated.
Current Behavior
auto embd_inp = ::llama_tokenize(ctx, params.prompt, true);
embd_inp is the params.prompt (e.g. alpaca.txt)
params.n_keep = (int)embd_inp.size();
n_keep is the size of the permanent prompt (e.g. alpaca.txt)
n_past = params.n_keep; embd.insert(embd.begin(), last_n_tokens.begin() + n_ctx - n_left/2 - embd.size(), last_n_tokens.end() - embd.size()); n_past += embd.size();
embd now has a certain amount of token from last_n_tokens + the original embd. But no longer has the permanent prompt (e.g. alpaca.txt) n_past = size of embd +n_keep (size of permanent prompt (e.g. alpaca.txt)). But in the context, the n_keep token before embd are NOT the permanent prompt (e.g. alpaca.txt). The permanant prompt is all the way at the begining of last_n_tokens.
Are my statements correct?
Suggestions:
To solve for that we could: embd.insert(embd.begin(), last_n_tokens.begin() + n_ctx - n_left/2 - embd.size() - n_keep, last_n_tokens.end() - embd.size()); embd.insert(embd.begin(), last_n_tokens.begin(), n_keep);
Now we have : permanent prompt (e.g. alpaca.txt) + the old context we kept + the original embd.
Is this right?
Problem: this would basically recompute the permanent prompt (e.g. alpaca.txt) every time the context reach the max size. Why is this a problem? I run a model where the permanent prompt is 1000 tokens (multi shot prompt) and the questions are 250 tokens. Hence recomputing the permanent prompt everytime is painfull. Question: How to we recover / save the computation of the permanent prompt and then bring it back when the context is full?
No.
embd only has new tokens to be evaluated, the kept tokens from the beginning do not need to be evaluated again, this is the whole idea of this performance feature.
The LLaMa model doesn't need to see the tokens themselves, the only necessary parameter is n_past which you can see always will include n_keep. The model will get the past token data from the KV cache.
If there is something I would improve in the code is to keep a representation of the exact context that the model has at the moment around. This way n_keep could be derived simply by getting the length of the initial common substring (of tokens) of the new text and the old.
EDIT: I should also mention that last_n_tokens is kind of special in that it remembers all tokens, even if the context is truncated, but it is not used for evaluation, only for sampling.
last_n_tokens is not the actual context. I understand that. Is there a way to see the actual context? Is that what you would like to be able to see?
n_past is the number of token reused from the past tokens (ie the context). It is n_past tokens starting from the end or the beginning of the context?
I don't understand where the context is being truncated following the line if ((n_past + (int) embd.size() > n_ctx)) The only line of code is embd.insert(), which will ultimately add more to the context. Where is the line that truncates the context?
Thanks a ton for your help.
I don't understand where the context is being truncated following the line if ((n_past + (int) embd.size() > n_ctx))
It is the line:
n_past = params.n_keep;
That is it. That is all the model needs to know. The model will now calculate as if only n_keep tokens have been evaluated. You can see that n_past is a parameter into the evaluation function. It doesn't need the actual tokens. The state is actually stored in the KV cache.
embd contains new tokens to be evaluated. The complicated-looking insert() adds some of the last seen tokens into it before the new tokens form the user. Note that last_n_tokens are always added to the end of this array so that's why it is calculating it in that way.