gpt4all icon indicating copy to clipboard operation
gpt4all copied to clipboard

Allow custom prompt limit (n_ctx=2048)

Open jeffochoa opened this issue 1 year ago • 12 comments

Feature request

Currently there is a limitation on the number of characters that can be used in the prompt

GPT-J ERROR: The prompt is 9884 tokens and the context window is 2048!

The error is produced in GPTJ::prompt(). Here, it looks like the prompt n_ctx that arrives from the frontend is not used, but instead the value comes from the model itself... As such, setting the value yourself won't really matter. (see more)

https://github.com/nomic-ai/gpt4all/blob/8204c2eb806aeab055b7a7fae4b4adc02e34ef41/gpt4all-backend/gptj.cpp#L920

Motivation

Being able to customise the prompt input limit could allow developers to build more complete plugins to interact with the model, using a more useful context and longer conversation history.

For example, right now, it is almost imposible to build a plugin to browse the web as you can't use a page content (html) as part of the context because it can easily excede the input limit.

Your contribution

.

jeffochoa avatar May 21 '23 19:05 jeffochoa

This is more of a limit on the model's context limit. It's only trained with a context window of 2048 so exceeding that isn't really possible at the moment with the existing models

zanussbaum avatar May 22 '23 00:05 zanussbaum

The Mosaic models have a much bigger context window, even their base models are build to exceed smaller context windows: https://www.mosaicml.com/blog/mpt-7b

menelic avatar May 22 '23 09:05 menelic

The Mosaic models have a much bigger context window, even their base models are build to exceed smaller context windows: https://www.mosaicml.com/blog/mpt-7b

Intersting. Have you been able to use one of those models with the GPT4ALL library?

jeffochoa avatar May 22 '23 13:05 jeffochoa

that's correct, Mosaic models have a context length up to 4096 for the models that have ported to GPT4All. However, GPT-J models are still limited by the 2048 prompt length so using more tokens will not work well.

zanussbaum avatar May 22 '23 13:05 zanussbaum

I used the mp7-7b-chat model and specified the n_ctx=4096 but still got the error -

llm = GPT4All(model='../models/ggml-mpt-7b-chat.bin',
                            verbose=False,
                            temp=0,
                            top_p=0.95,
                            top_k=40,
                            repeat_penalty=1.1,
                            n_ctx=4096,
                            callback_manager=stream_manager)

Error log:

Found model file.
mpt_model_load: loading model from '../models/ggml-mpt-7b-chat.bin' - please wait ...
mpt_model_load: n_vocab        = 50432
mpt_model_load: n_ctx          = 2048
mpt_model_load: n_embd         = 4096
mpt_model_load: n_head         = 32
mpt_model_load: n_layer        = 32
mpt_model_load: alibi_bias_max = 8.000000
mpt_model_load: clip_qkv       = 0.000000
mpt_model_load: ftype          = 2
mpt_model_load: ggml ctx size = 5653.09 MB
mpt_model_load: kv self size  = 1024.00 MB
mpt_model_load: ........................ done
mpt_model_load: model size =  4629.02 MB / num tensors = 194
INFO:     connection open
ERROR: The prompt size exceeds the context window size and cannot be processed.GPT-J ERROR: The prompt is2115tokens and the context window is2048!

jpzhangvincent avatar May 29 '23 02:05 jpzhangvincent

I used the mp7-7b-chat model and specified the n_ctx=4096 but still got the error -

llm = GPT4All(model='../models/ggml-mpt-7b-chat.bin',
                            verbose=False,
                            temp=0,
                            top_p=0.95,
                            top_k=40,
                            repeat_penalty=1.1,
                            n_ctx=4096,
                            callback_manager=stream_manager)

Error log:

Found model file.
mpt_model_load: loading model from '../models/ggml-mpt-7b-chat.bin' - please wait ...
mpt_model_load: n_vocab        = 50432
mpt_model_load: n_ctx          = 2048
mpt_model_load: n_embd         = 4096
mpt_model_load: n_head         = 32
mpt_model_load: n_layer        = 32
mpt_model_load: alibi_bias_max = 8.000000
mpt_model_load: clip_qkv       = 0.000000
mpt_model_load: ftype          = 2
mpt_model_load: ggml ctx size = 5653.09 MB
mpt_model_load: kv self size  = 1024.00 MB
mpt_model_load: ........................ done
mpt_model_load: model size =  4629.02 MB / num tensors = 194
INFO:     connection open
ERROR: The prompt size exceeds the context window size and cannot be processed.GPT-J ERROR: The prompt is2115tokens and the context window is2048!

yes, me too!

crixue avatar May 31 '23 03:05 crixue

It would be great to have n_ctx in the model constructor, not in the generate method though.

I've been playing around with ggml a bit, trying to implement a growing buffer on the fly, and this is reeeally slow. ggml uses pointers instead of offsets under the hood which means I cannot just realloc and memcpy memory buffers (KV cache) for the model.

Chae4ek avatar Jul 03 '23 21:07 Chae4ek

It's in the settings by now!

niansa avatar Aug 11 '23 13:08 niansa

Nevermind me!

niansa avatar Aug 11 '23 14:08 niansa

Results of today 2023-10-31 (building from source) for setting different values for n_ctx in file "llamamodel.cpp". According to some people in discord, this allows a higher context window size. I use Windows 10 with 32 GB RAM and models loading with CPU.

My prompt, which in total consisted of 5652 characters, was an instruction to summarize a long text.

original value: 2048 new value: 16384 model that was trained for/with 16K context: Response loads endlessly long. I force closed the programm. 👎

original value: 2048 new value: 32768 model that was trained for/with 32K context: Response loads endlessly long. I force closed programm. 👎

original value: 2048 new value: 8192 model that was trained for/with 16K context: Response loads very long, but eventually finishes loading after a few minutes and gives reasonable output 👍

original value: 2048 new value: 8192 model (Mistral Instruct) that was trained for/with 4096 context: Response loads very long, but eventually finishes loading after a few minutes and gives reasonable output 👍

original value: 2048 new value: 8192 Another model that was presumably trained for/with 4096 context: Response loads endlessly long. I force closed the programm 👎 Here a typical graph from while the model was generating the response. Notice how RAM gets emptied and then filled a little later? image

I am still experimenting, but I believe so far, success depends on

  1. the model you use.
  2. values for n_ctx
  3. one or multiple other factors, because clearly something gets stuck or is very inefficient.

I would not recommend setting n_ctx to higher values and releasing this new version of gpt4all to the public without extensive testing.

Edit: I have opted to set context size to 4096 by default, because most models I use are designed for that. E.g. the mistral models mostly need 4096 and then use advanced techniques to extend that via sliding window or rope scaling, which I get the feeling works without having to set n_ctx, but I have not done extensive testing on that. If somebody does, might be nice if you could post your finding here.

ThiloteE avatar Oct 31 '23 00:10 ThiloteE

Seems like this issue will be fixed by #1668

ThiloteE avatar Nov 28 '23 10:11 ThiloteE

Seems like this issue will be fixed by #1668

The OP refers to GPT-J which is the only model that will not be fixed by #1749.

cebtenzzre avatar Dec 14 '23 17:12 cebtenzzre