llama.cpp ggml_new_tensor_impl: not enough space in the context's memory pool

Heya! Friend showed this to me and I'm trying to get it to work myself on Windows 10. I've applied the changes as seen in #22 to get it to build (more specifically, I pulled in the new commits from etra0's fork, but the actual executable fails to run - printing this before segfaulting:

ggml_new_tensor_impl: not enough space in the context's memory pool (needed 458853944, available 454395136)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 458870468, available 454395136)

I'm trying to use 7B on an i9-13900K (and I have about 30 gigs of memory free right now), and I've verified my hashes with a friend. Any ideas? Thanks!

Mar 12 '23 01:03 NotNite

Tried out #31 - it, uh, got farther: GGML_ASSERT: D:\code\c++\llama.cpp\ggml.c:9349: false

Mar 12 '23 04:03 NotNite

ok I did an upsie in that PR, initializing it that way apparently didn't zero'ed out the rest of the fields. I updated the branch, please test it again now!

Mar 12 '23 05:03 etra0

ok I did an upsie in that PR, initializing it that way apparently didn't zero'ed out the rest of the fields. I updated the branch, please test it again now!

It started to expand the prompt, but with seemingly garbage data: Building a website can be done in 10 simple steps: ╨Ñ╤Ç╨╛╨╜╨╛╨╗╨╛╨│╨╕╤ÿ╨

Mar 12 '23 06:03 NotNite

Should be good on latest master - reopen if issue persists. Make sure to rebuild and regen the models after updating

Mar 13 '23 17:03 ggerganov

Hey i was trying to run this on a RHEL 8 server with 32 cpu cores. and i am getting the same error. On my second query.

I am using GPT4All-J v1.3-groovy.

ggml_new_tensor_impl: not enough space in the context's memory pool

Jun 12 '23 10:06 eshaanagarwal

Hi @ggerganov @gjmulder I would appreciate some direction for this pls.

Jun 13 '23 07:06 eshaanagarwal

Getting the same issue on Apple M1 Pro with 16GB RAM when trying the example from:

https://github.com/curiousily/Get-Things-Done-with-Prompt-Engineering-and-LangChain/blob/master/06.private-gpt4all-qa-pdf.ipynb

Using a relatively large PDF with ~200 pages

Stack trace:

gpt_tokenize: unknown token '?' ggml_new_tensor_impl: not enough space in the context's memory pool (needed 16118890208, available 16072355200) [1] 62734 segmentation fault python3 /opt/homebrew/Cellar/[email protected]/3.11.4/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '

Jun 18 '23 01:06 superbsky

Same issue when running on Win11 with 64GB RAM (25 GB utilized):

ggml_new_tensor_impl: not enough space in the scratch memory pool (needed 450887680, available 446693376) Traceback (most recent call last): File "C:\AI\oobabooga_windows_GPU\text-generation-webui\modules\callbacks.py", line 55, in gentask ret = self.mfunc(callback=_callback, *args, **self.kwargs) File "C:\AI\oobabooga_windows_GPU\text-generation-webui\modules\llamacpp_model.py", line 92, in generate for completion_chunk in completion_chunks: File "C:\AI\oobabooga_windows_GPU\installer_files\env\lib\site-packages\llama_cpp\llama.py", line 891, in _create_completion for token in self.generate( File "C:\AI\oobabooga_windows_GPU\installer_files\env\lib\site-packages\llama_cpp\llama.py", line 713, in generate self.eval(tokens) File "C:\AI\oobabooga_windows_GPU\installer_files\env\lib\site-packages\llama_cpp\llama.py", line 453, in eval return_code = llama_cpp.llama_eval( File "C:\AI\oobabooga_windows_GPU\installer_files\env\lib\site-packages\llama_cpp\llama_cpp.py", line 612, in llama_eval return _lib.llama_eval(ctx, tokens, n_tokens, n_past, n_threads) OSError: exception: access violation reading 0x0000000000000028 Output generated in 39.00 seconds (0.00 tokens/s, 0 tokens, context 5200, seed 1177762893)

Jul 18 '23 13:07 dzupin

Same issue when running on Win11 with 64GB RAM (25 GB utilized): [snip]

Oh hey, exact same error:

ggml_new_tensor_impl: not enough space in the scratch memory pool (needed 452859040, available 446693376)

Jul 25 '23 01:07 LoganDark

Same issue here, tried a combination of settings but just keep getting the memory error even though both RAM and GPU RAM are less than 50% utilization.

I had to follow the guide here to build llama-cpp with GPU support as it wasn't working previously, but even before that it was giving the same error (side note GPU support natively does work in oobabooga windows!?):
https://github.com/abetlen/llama-cpp-python/issues/182

Anyone have any ideas?

HW: Intel i9-10900K OC @5.3GHz 64GB DDR4-2400 / PC4-19200 12GB Nvidia GeForce RTX 3060

Using embedded DuckDB with persistence: data will be stored in: db ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6 llama.cpp: loading model from models/llama7b/llama-deus-7b-v3.ggmlv3.q4_0.bin llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 2048 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_head_kv = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: n_gqa = 1 llama_model_load_internal: rnorm_eps = 5.0e-06 llama_model_load_internal: n_ff = 11008 llama_model_load_internal: freq_base = 10000.0 llama_model_load_internal: freq_scale = 1 llama_model_load_internal: ftype = 2 (mostly Q4_0) llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 0.08 MB llama_model_load_internal: using CUDA for GPU acceleration llama_model_load_internal: mem required = 2927.79 MB (+ 1024.00 MB per state) llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 384 MB VRAM for the scratch buffer llama_model_load_internal: offloading 10 repeating layers to GPU llama_model_load_internal: offloaded 10/35 layers to GPU llama_model_load_internal: total VRAM used: 1470 MB llama_new_context_with_model: kv self size = 1024.00 MB AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |

What would you like to know about the policies?

test

ggml_new_object: not enough space in the context's memory pool (needed 10882896, available 10650320) Traceback (most recent call last): File "H:\AI_Projects\Indexer_Plus_GPT\chat.py", line 84, in main() File "H:\AI_Projects\Indexer_Plus_GPT\chat.py", line 55, in main res = qa(query) File "C:\Program Files\Python310\lib\site-packages\langchain\chains\base.py", line 243, in call raise e File "C:\Program Files\Python310\lib\site-packages\langchain\chains\base.py", line 237, in call self._call(inputs, run_manager=run_manager) File "C:\Program Files\Python310\lib\site-packages\langchain\chains\retrieval_qa\base.py", line 133, in _call answer = self.combine_documents_chain.run( File "C:\Program Files\Python310\lib\site-packages\langchain\chains\base.py", line 441, in run return self(kwargs, callbacks=callbacks, tags=tags, metadata=metadata)[ File "C:\Program Files\Python310\lib\site-packages\langchain\chains\base.py", line 243, in call raise e File "C:\Program Files\Python310\lib\site-packages\langchain\chains\base.py", line 237, in call self._call(inputs, run_manager=run_manager) File "C:\Program Files\Python310\lib\site-packages\langchain\chains\combine_documents\base.py", line 106, in _call output, extra_return_dict = self.combine_docs( File "C:\Program Files\Python310\lib\site-packages\langchain\chains\combine_documents\stuff.py", line 165, in combine_docs return self.llm_chain.predict(callbacks=callbacks, **inputs), {} File "C:\Program Files\Python310\lib\site-packages\langchain\chains\llm.py", line 252, in predict return self(kwargs, callbacks=callbacks)[self.output_key] File "C:\Program Files\Python310\lib\site-packages\langchain\chains\base.py", line 243, in call raise e File "C:\Program Files\Python310\lib\site-packages\langchain\chains\base.py", line 237, in call self._call(inputs, run_manager=run_manager) File "C:\Program Files\Python310\lib\site-packages\langchain\chains\llm.py", line 92, in _call response = self.generate([inputs], run_manager=run_manager) File "C:\Program Files\Python310\lib\site-packages\langchain\chains\llm.py", line 102, in generate return self.llm.generate_prompt( File "C:\Program Files\Python310\lib\site-packages\langchain\llms\base.py", line 188, in generate_prompt return self.generate(prompt_strings, stop=stop, callbacks=callbacks, **kwargs) File "C:\Program Files\Python310\lib\site-packages\langchain\llms\base.py", line 281, in generate output = self._generate_helper( File "C:\Program Files\Python310\lib\site-packages\langchain\llms\base.py", line 225, in _generate_helper raise e File "C:\Program Files\Python310\lib\site-packages\langchain\llms\base.py", line 212, in _generate_helper self._generate( File "C:\Program Files\Python310\lib\site-packages\langchain\llms\base.py", line 604, in _generate self._call(prompt, stop=stop, run_manager=run_manager, **kwargs) File "C:\Program Files\Python310\lib\site-packages\langchain\llms\llamacpp.py", line 229, in _call for token in self.stream(prompt=prompt, stop=stop, run_manager=run_manager): File "C:\Program Files\Python310\lib\site-packages\langchain\llms\llamacpp.py", line 279, in stream for chunk in result: File "C:\Program Files\Python310\lib\site-packages\llama_cpp\llama.py", line 899, in _create_completion for token in self.generate( File "C:\Program Files\Python310\lib\site-packages\llama_cpp\llama.py", line 721, in generate self.eval(tokens) File "C:\Program Files\Python310\lib\site-packages\llama_cpp\llama.py", line 461, in eval return_code = llama_cpp.llama_eval( File "C:\Program Files\Python310\lib\site-packages\llama_cpp\llama_cpp.py", line 678, in llama_eval return _lib.llama_eval(ctx, tokens, n_tokens, n_past, n_threads) OSError: exception: access violation reading 0x0000000000000000