gpt4all
gpt4all copied to clipboard
Allow custom prompt limit (n_ctx=2048)
Feature request
Currently there is a limitation on the number of characters that can be used in the prompt
GPT-J ERROR: The prompt is 9884 tokens and the context window is 2048!
The error is produced in GPTJ::prompt(). Here, it looks like the prompt n_ctx that arrives from the frontend is not used, but instead the value comes from the model itself... As such, setting the value yourself won't really matter. (see more)
https://github.com/nomic-ai/gpt4all/blob/8204c2eb806aeab055b7a7fae4b4adc02e34ef41/gpt4all-backend/gptj.cpp#L920
Motivation
Being able to customise the prompt input limit could allow developers to build more complete plugins to interact with the model, using a more useful context and longer conversation history.
For example, right now, it is almost imposible to build a plugin to browse the web as you can't use a page content (html) as part of the context because it can easily excede the input limit.
Your contribution
.
This is more of a limit on the model's context limit. It's only trained with a context window of 2048 so exceeding that isn't really possible at the moment with the existing models
The Mosaic models have a much bigger context window, even their base models are build to exceed smaller context windows: https://www.mosaicml.com/blog/mpt-7b
The Mosaic models have a much bigger context window, even their base models are build to exceed smaller context windows: https://www.mosaicml.com/blog/mpt-7b
Intersting. Have you been able to use one of those models with the GPT4ALL library?
that's correct, Mosaic models have a context length up to 4096 for the models that have ported to GPT4All. However, GPT-J models are still limited by the 2048 prompt length so using more tokens will not work well.
I used the mp7-7b-chat
model and specified the n_ctx=4096
but still got the error -
llm = GPT4All(model='../models/ggml-mpt-7b-chat.bin',
verbose=False,
temp=0,
top_p=0.95,
top_k=40,
repeat_penalty=1.1,
n_ctx=4096,
callback_manager=stream_manager)
Error log:
Found model file.
mpt_model_load: loading model from '../models/ggml-mpt-7b-chat.bin' - please wait ...
mpt_model_load: n_vocab = 50432
mpt_model_load: n_ctx = 2048
mpt_model_load: n_embd = 4096
mpt_model_load: n_head = 32
mpt_model_load: n_layer = 32
mpt_model_load: alibi_bias_max = 8.000000
mpt_model_load: clip_qkv = 0.000000
mpt_model_load: ftype = 2
mpt_model_load: ggml ctx size = 5653.09 MB
mpt_model_load: kv self size = 1024.00 MB
mpt_model_load: ........................ done
mpt_model_load: model size = 4629.02 MB / num tensors = 194
INFO: connection open
ERROR: The prompt size exceeds the context window size and cannot be processed.GPT-J ERROR: The prompt is2115tokens and the context window is2048!
I used the
mp7-7b-chat
model and specified then_ctx=4096
but still got the error -llm = GPT4All(model='../models/ggml-mpt-7b-chat.bin', verbose=False, temp=0, top_p=0.95, top_k=40, repeat_penalty=1.1, n_ctx=4096, callback_manager=stream_manager)
Error log:
Found model file. mpt_model_load: loading model from '../models/ggml-mpt-7b-chat.bin' - please wait ... mpt_model_load: n_vocab = 50432 mpt_model_load: n_ctx = 2048 mpt_model_load: n_embd = 4096 mpt_model_load: n_head = 32 mpt_model_load: n_layer = 32 mpt_model_load: alibi_bias_max = 8.000000 mpt_model_load: clip_qkv = 0.000000 mpt_model_load: ftype = 2 mpt_model_load: ggml ctx size = 5653.09 MB mpt_model_load: kv self size = 1024.00 MB mpt_model_load: ........................ done mpt_model_load: model size = 4629.02 MB / num tensors = 194 INFO: connection open ERROR: The prompt size exceeds the context window size and cannot be processed.GPT-J ERROR: The prompt is2115tokens and the context window is2048!
yes, me too!
It would be great to have n_ctx
in the model constructor, not in the generate method though.
I've been playing around with ggml a bit, trying to implement a growing buffer on the fly, and this is reeeally slow. ggml uses pointers instead of offsets under the hood which means I cannot just realloc and memcpy memory buffers (KV cache) for the model.
It's in the settings by now!
Nevermind me!
Results of today 2023-10-31 (building from source) for setting different values for n_ctx
in file "llamamodel.cpp". According to some people in discord, this allows a higher context window size. I use Windows 10 with 32 GB RAM and models loading with CPU.
My prompt, which in total consisted of 5652 characters, was an instruction to summarize a long text.
original value: 2048
new value: 16384
model that was trained for/with 16K context: Response loads endlessly long. I force closed the programm. 👎
original value: 2048
new value: 32768
model that was trained for/with 32K context: Response loads endlessly long. I force closed programm. 👎
original value: 2048
new value: 8192
model that was trained for/with 16K context: Response loads very long, but eventually finishes loading after a few minutes and gives reasonable output 👍
original value: 2048
new value: 8192
model (Mistral Instruct) that was trained for/with 4096 context: Response loads very long, but eventually finishes loading after a few minutes and gives reasonable output 👍
original value: 2048
new value: 8192
Another model that was presumably trained for/with 4096 context: Response loads endlessly long. I force closed the programm 👎
Here a typical graph from while the model was generating the response. Notice how RAM gets emptied and then filled a little later?
I am still experimenting, but I believe so far, success depends on
- the model you use.
- values for n_ctx
- one or multiple other factors, because clearly something gets stuck or is very inefficient.
I would not recommend setting n_ctx to higher values and releasing this new version of gpt4all to the public without extensive testing.
Edit: I have opted to set context size to 4096 by default, because most models I use are designed for that. E.g. the mistral models mostly need 4096 and then use advanced techniques to extend that via sliding window or rope scaling, which I get the feeling works without having to set n_ctx, but I have not done extensive testing on that. If somebody does, might be nice if you could post your finding here.
Seems like this issue will be fixed by #1668
Seems like this issue will be fixed by #1668
The OP refers to GPT-J which is the only model that will not be fixed by #1749.