ggml starcoder -- not enough space in the context's memory pool

I'm getting errors with starcoder models when I try to include any non-trivial amount of tokens. I'm getting this with both my raw model (direct .bin) and quantized model regardless of version (pre Q4/Q5 changes and post Q4/Q5 changes).

Relevant error:

ggml_new_tensor_impl: not enough space in the context's memory pool (needed 412241472, available 411790368)

Example:

./build/bin/starcoder -m /workspaces/research/models/starcoder/starcoder-ggml.bin -p "def fibo( fibo fib fibo test waterfibo test waterfibo test waterfibo test waterfibo test waterfibo test waterfibo test waterfibo test waterfibo test " --top_k 0 --top_p 0.95 --temp 0.2

will cause the error

main: seed = 1684223471
starcoder_model_load: loading model from '/workspaces/research/models/starcoder/starcoder-ggml.bin'
starcoder_model_load: n_vocab = 49152
starcoder_model_load: n_ctx   = 8192
starcoder_model_load: n_embd  = 6144
starcoder_model_load: n_head  = 48
starcoder_model_load: n_layer = 40
starcoder_model_load: ftype   = 1
starcoder_model_load: qntvr   = 0
starcoder_model_load: ggml ctx size = 51276.47 MB
starcoder_model_load: memory size = 15360.00 MB, n_mem = 327680
starcoder_model_load: model size  = 35916.23 MB
main: prompt: 'def fibo( fibo fib fibo test waterfibo test waterfibo test waterfibo test waterfibo test waterfibo test waterfibo test waterfibo test waterfibo test '
main: number of tokens in prompt = 51, first 8 tokens: 589 28176 97 26 28176 97 28176 28176 

def fibo( fibo fib fibo test waterfibo test waterfibo test waterfibo test waterfibo test waterfibo test waterfibo test waterfibo test waterfibggml_new_tensor_impl: not enough space in the context's memory pool (needed 412241472, available 411952576)
Segmentation fault (core dumped)

(Here's another output from the quantized model)

vscode ➜ /workspaces/research/others/ggml (master) $ ./build/bin/starcoder -m /workspaces/research/models/starcoder/starcoder-ggml-q4_1.bin -p "def fibo( fibo fib fibo test wate
rfibo test waterfibo test waterfibo test waterfibo test waterfibo test waterfibo test waterfibo test waterfibo test " --top_k 0 --top_p 0.95 --temp 0.2 
main: seed = 1684223600
starcoder_model_load: loading model from '/workspaces/research/models/starcoder/starcoder-ggml-q4_1.bin'
starcoder_model_load: n_vocab = 49152
starcoder_model_load: n_ctx   = 8192
starcoder_model_load: n_embd  = 6144
starcoder_model_load: n_head  = 48
starcoder_model_load: n_layer = 40
starcoder_model_load: ftype   = 1003
starcoder_model_load: qntvr   = 1
starcoder_model_load: ggml ctx size = 28956.47 MB
starcoder_model_load: memory size = 15360.00 MB, n_mem = 327680
starcoder_model_load: model size  = 13596.23 MB
main: prompt: 'def fibo( fibo fib fibo test waterfibo test waterfibo test waterfibo test waterfibo test waterfibo test waterfibo test waterfibo test waterfibo test '
main: number of tokens in prompt = 51, first 8 tokens: 589 28176 97 26 28176 97 28176 28176 

def fibo( fibo fib fibo test waterfibo test waterfibo test waterfibo test waterfibo test waterfibo test waterfibo test waterfibo test waterfibggml_new_tensor_impl: not enough space in the context's memory pool (needed 412241472, available 411790368)
Segmentation fault (core dumped)

Best I can find in the past was https://github.com/ggerganov/llama.cpp/issues/29

But, maybe that was fixed in llama models, but the problem has returned for starcoder?

Based on: https://github.com/ggerganov/ggml/pull/146

Specifically hoping that @NouamaneTazi might have some clarity on why this might be happening?

May 16 '23 07:05 bluecoconut

Interesting find! Thank you for raising this. Two questions:

Does this happen with santacoder model also or just starcoder?
Can you try using this repo https://github.com/bigcode-project/starcoder.cpp where I used ggml files from the llama.cpp repo?

May 16 '23 08:05 NouamaneTazi

Just tried santacoder and it does seem to have the same problem, but at a very different scale. (Error is the same) (had to put in >700, maybe around 1000 tokens or so... so this might just be normal? context length issues?)

example code I used to test santacoder (note, this isn't directly on ggml executable, but through ctransformers, but, same errors show up as shown in the original post, where i directly just use the compiled ./starcoder, so i think it's safe to say that it'd behave the same on the underlying ggml)

Python 3.10.11 (main, Apr 12 2023, 14:46:22) [GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import lambdaprompt as lp
>>> import os
>>> os.environ['LAMBDAPROMPT_BACKEND'] = 'SantaCoderGGML'
>>> comp = lp.Completion("# Some code to print fibonacci numbers\n"*100, max_new_tokens=100)
>>> comp()
Fetching 0 files: 0it [00:00, ?it/s]
Fetching 1 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 25575.02it/s]
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 268617232, available 268435456)
Segmentation fault (core dumped)

(I did one other test with "# Some code to print fibonacci numbers\n"*60 and this one successfully ran on santacoder)

>>> len(lp.backends.backends['completion'].model.tokenize("# Some code to print fibonacci numbers\n"*60))
720
>>> len(lp.backends.backends['completion'].model.tokenize("# Some code to print fibonacci numbers\n"*100))
1200

I'll try out the starcoder.cpp and raw ggml with santacoder later / when I'm back at my machine.

May 16 '23 09:05 bluecoconut

https://github.com/bigcode-project/starcoder.cpp/issues/3

Seems someone else has run into this on the starcoder.cpp

May 16 '23 20:05 bluecoconut

I tried looking into this but the python script from the example fails to download the model on Mac OS:

 $ ▶ python3 examples/starcoder/convert-hf-to-ggml.py bigcode/gpt_bigcode-santacoder
Loading model:  bigcode/gpt_bigcode-santacoder
Traceback (most recent call last):
  File "/Users/ggerganov/development/github/ggml/examples/starcoder/convert-hf-to-ggml.py", line 56, in <module>
    config = AutoConfig.from_pretrained(model_name, trust_remote_code=True)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 766, in from_pretrained
    config_class = CONFIG_MAPPING[config_dict["model_type"]]
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 473, in __getitem__
    raise KeyError(key)
KeyError: 'gpt_bigcode'

Any ideas how to fix this?

May 20 '23 13:05 ggerganov

@ggerganov I think you're on an old version of transformers Try updating it: pip install -U transformers

May 20 '23 14:05 NouamaneTazi

@ggerganov I've been trying to increase context's memory pool by modifying this part of the code

        ctx_size += 10 * 1024 * 1024; // TODO: tune this

        printf("%s: ggml ctx size = %6.2f MB\n", __func__, ctx_size/(1024.0*1024.0));

but it doesnt seem to affect ctx->mem_size because the error message is always the same: ggml_new_tensor_impl: not enough space in the context's memory pool (needed 268637760, available 268435456) ( ctx->mem_size = 268435456 where it should be more)

Any idea how to increase ctx->mem_size? Relevant PR

May 20 '23 15:05 NouamaneTazi

The problem is in the "eval" context:

https://github.com/ggerganov/ggml/blob/c2fab8a3503b6e6fbf480be993f24c21951d3af0/examples/starcoder/main.cpp#L415-L431

Currently, it starts with a 256 MB buffer and is increased based on N. But this does not take into account n_past and in general is a very memory wasteful approach since the entire compute graph results are stored in this buffer.

Here I tried to improve this using scratch buffers: https://github.com/ggerganov/ggml/pull/176

Please give it a try and let me know if your tests still crash using this version

May 20 '23 15:05 ggerganov

I am observing a similar issue with the python wrapper llama-cpp-llama: https://github.com/abetlen/llama-cpp-python/issues/356

Jun 10 '23 04:06 vmajor

Hi I was trying GPT4all 1.3 groovy model and i faced the same issue. i am not able to understand why this is happening, Can anybody provide me with some solution for it.

Jun 13 '23 08:06 eshaanagarwal

@eshaanagarwal the only "solution" that I found was a reboot. Since rebooting is not an option I had to switch to different models. For me all 30B/33B LLM models eventually develop this error when the input context is reaching the upper limit. This does not affect the 65B models. I do not know about any other relationships as this is my use case.

Jun 13 '23 08:06 vmajor

@eshaanagarwal the only "solution" that I found was a reboot. Since rebooting is not an option I had to switch to different models. For me all 30B/33B LLM models eventually develop this error when the input context is reaching the upper limit. This does not affect the 65B models. I do not know about any other relationships as this is my use case.

@ggerganov can the memory leak or the issue be fixed ? Or any possible direction as to how to fix it ? Because I really need for this model to work

Jun 13 '23 08:06 eshaanagarwal

@eshaanagarwal If you are using the latest version of the starcoder example the issue should not occur. It was fixed in https://github.com/ggerganov/ggml/pull/176

If the issue occur, please provide more details about the model that you are using, your system information and the parameters with which you trigger the error

Jun 18 '23 09:06 ggerganov