serge Prompt is too long

Bug description

It appears to happen as the chat gets longer and longer. How should I address this?

A server error occurred. See below: 
RETURN CODE 1 

main: seed = 1680564613 llama_model_load: loading model from '/usr/src/app/weights/gpt4all.bin' 

main: error: prompt is too long (662 tokens, max 508)

I happen to be using gpt4all.

Steps to reproduce

Just keep chatting. Over time, it will eventually hit this error.

Environment Information

Docker version 20.10.17, build 100c701

Screenshots

Screenshot 2023-04-03 at 4 33 41 PM

Relevant log output

serge-1  | ERROR:	serge.utils.generate	main: seed = 1680564613
serge-1  | llama_model_load: loading model from '/usr/src/app/weights/gpt4all.bin' - please wait ...
serge-1  | llama_model_load: n_vocab = 32001
serge-1  | llama_model_load: n_ctx   = 512
serge-1  | llama_model_load: n_embd  = 4096
serge-1  | llama_model_load: n_mult  = 256
serge-1  | llama_model_load: n_head  = 32
serge-1  | llama_model_load: n_layer = 32
serge-1  | llama_model_load: n_rot   = 128
serge-1  | llama_model_load: f16     = 2
serge-1  | llama_model_load: n_ff    = 11008
serge-1  | llama_model_load: n_parts = 1
serge-1  | llama_model_load: type    = 1
serge-1  | llama_model_load: ggml map size = 4017.70 MB
serge-1  | llama_model_load: ggml ctx size =  81.25 KB
serge-1  | llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)
serge-1  | llama_model_load: loading tensors from '/usr/src/app/weights/gpt4all.bin'
serge-1  | llama_model_load: model size =  4017.27 MB / num tensors = 291
serge-1  | llama_init_from_file: kv self size  =  256.00 MB
serge-1  | 
serge-1  | system_info: n_threads = 32 / 32 | AVX = 1 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
serge-1  | main: error: prompt is too long (662 tokens, max 508)

Confirmations

[X] I'm running the latest version of the main branch.
[X] I checked existing issues to see if this has already been described.

Apr 03 '23 23:04 johncadengo

I've looked through the source code of both llama and Serge and it appears to be caused by Serge not cropping the text after the combined prompt of all previous messages exceeds the maximum token count defined as the Prompt Context Length. But it also seems that Serge does not correctly store the max token count, because in all error messages the "llama_model_load: n_ctx = 512" is always 512. I haven't found the root cause for why Serge always sets n_ctx to 512 even if you set it really low in the settings. The "max 508" in the error output is explained by llama reducing the max token count by 4.

Apr 04 '23 09:04 DVSProductions

I am also experiencing this after some promts

Apr 04 '23 19:04 M1xel

Thanks @DVSProductions. I opened PR #138 to address the issue you brought up of it not correctly passing through the prompt context length.

Apr 04 '23 20:04 johncadengo

I have this issue a lot, real party pooper.

Apr 05 '23 16:04 gerroon