llama.cpp
llama.cpp copied to clipboard
Segmentation Fault Error "not enough space in the context's memory pool"
This prompt with the 65B model on an M1 Max 64GB results in a segmentation fault. Works with 30B model. Are there problems with longer prompts? Related to #12
./main --model ./models/65B/ggml-model-q4_0.bin --prompt "You are a question answering bot that is able to answer questions about the world. You are extremely smart, knowledgeable, capable, and helpful. You always give complete, accurate, and very detailed responses to questions, and never stop a response in mid-sentence or mid-thought. You answer questions in the following format:
Question: What’s the history of bullfighting in Spain?
Answer: Bullfighting, also known as "tauromachia," has a long and storied history in Spain, with roots that can be traced back to ancient civilizations. The sport is believed to have originated in 7th-century BCE Iberian Peninsula as a form of animal worship, and it evolved over time to become a sport and form of entertainment. Bullfighting as it is known today became popular in Spain in the 17th and 18th centuries. During this time, the sport was heavily influenced by the traditions of medieval jousts and was performed by nobles and other members of the upper classes. Over time, bullfighting became more democratized and was performed by people from all walks of life. Bullfighting reached the height of its popularity in the 19th and early 20th centuries and was considered a national symbol of Spain. However, in recent decades, bullfighting has faced increasing opposition from animal rights activists, and its popularity has declined. Some regions of Spain have banned bullfighting, while others continue to hold bullfights as a cherished tradition. Despite its declining popularity, bullfighting remains an important part of Spanish culture and history, and it continues to be performed in many parts of the country to this day.
Now complete the following questions:
Question: What happened to the field of cybernetics in the 1970s?
Answer: "
Results in
...
sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000
You are a question answering bot that is able to answer questions about the world. You are extremely smart, knowledgeable, capable, and helpful. You always give complete, accurate, and very detailed responses to questions, and never stop a response in mid-sentence or mid-thought. You answer questions in the following format:
Question: What’s the history of bullfighting in Spain?
Answer: Bullfighting, also known as tauromachia, has a long and storied history in Spain, with roots that can be traced back to ancient civilizations. The sport is believed to have originated in 7th-century BCE Iberian Peninsula as a form of animal worship, and it evolved over time to become a sport and form of entertainment. Bullfighting as it is known today became popular in Spain in the 17th and 18th centuries. During this time, the sport was heavily influenced by the traditions of medieval jousts and was performed by nobles and other members of the upper classes. Over time, bullfighting became more democratized and was performed by people from all walks of life. Bullfighting reached the height of its popularity in the 19th and early 20th centuries and was considered a national symbol of Spain. However, in recent decades, bullfighting has faced increasing opposition from animal rights activists, and its popularity has declined. Some regions of Spain have banned bullfighting, while others continue to hold bullfights as a cherished tradition. Despite its declining popularity, bullfighting remainsggml_new_tensor_impl: not enough space in the context's memory pool (needed 701660720, available 700585498)
zsh: segmentation fault ./main --model ./models/65B/ggml-model-q4_0.bin --prompt
Are you running out of memory?
I experience this as well, and I always have 5-6gb of RAM free when it occurs and around 20gb of swap. It appears to be a known problem with memory allocation based on ggernanov's comments in #71
potentially fixed by #213
Latest commit b6b268d4415fd3b3e53f22b6619b724d4928f713 gives segmentation fault right away without even dropping into the input prompt. Was run on a Mac M1 Max with 64GB RAM. The crash happened on 30B LLaMA model but not on 7B. It was working fine even with the 65B model before the later commit.
== Running in interactive mode. ==
- Press Ctrl+C to interject at any time.
- Press Return to return control to LLaMa.
- If you want to submit another line, end your input in '\'.
Text transcript of a never ending dialog, where User interacts with an AI assistant named ChatLLaMa.
ChatLLaMa is helpful, kind, honest, friendly, good at writing and never fails to answer User’s requests immediately and with details and precision.
There are no annotations like (30 seconds passed...) or (to himself), just what User and ChatLLaMa say aloud to each other.
The dialog lasts for years, the entirety of it is shared below. It's 1000ggml_new_tensor_impl: not enough space in the context's memory pool (needed 536987232, available 536870912)
./chatLLaMa: line 53: 99012 Segmentation fault: 11 ./main $GEN_OPTIONS --model "$MODEL" --threads "$N_THREAD" --n_predict "$N_PREDICTS" --color --interactive --reverse-prompt "${USER_NAME}:" --prompt "
@edwios try https://github.com/ggerganov/llama.cpp/commit/404e1da38ec8025707031a8027da14dc1590f952 (the one before https://github.com/ggerganov/llama.cpp/commit/483bab2e3d4a868fe679d8bb32827d2a4df214dc) or try my pr https://github.com/ggerganov/llama.cpp/pull/438 (closed since gg is going to do it differently, but still should work until then)
Last known good commit I have just tested was indeed 404e1da38ec8025707031a8027da14dc1590f952
What error do you get with https://github.com/ggerganov/llama.cpp/commit/483bab2e3d4a868fe679d8bb32827d2a4df214dc ?
Same, ./chatLLaMa: line 53: 99012 Segmentation fault: 11 ./main $GEN_OPTIONS --model "$MODEL" --threads "$N_THREAD" --n_predict "$N_PREDICTS" --color --interactive --reverse-prompt "${USER_NAME}:" --prompt "
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 536987232, available 536870912)
Segmentation fault (core dumped)
just a few batches in. (30B q4_1) (just fed it a large file -f)
Yippy! Commit 2a2e63ce0503d9bf3e55283e40a052c78c1cc3a8 did fix the issue beautifully! Thank you!!
Hi I am facing the issue of our of memory for context while using gpt4all model 1.3 groovy with a 32 cpu 512 gb ram model using cpu inference.
Bumping on @eshaanagarwal ! Facing the same issue
Did it work for you with commit 2a2e63c and can you narrow down the commit that broke it?
In #1237, I changed some size_t
parameters to int
, I'm now worrying that may be the culprit. This was done because the dequantize functions already used int
for the number of elements.
I am getting the same error: ggml_new_tensor_impl: not enough space in the context's memory pool (needed 20976224, available 12582912). I see that this has been a problem since March 12th.
I am using llama-2-13b-chat.ggmlv3.q3_K_S.bin from TheBloke in Google Cloud Run with 32GB RAM and 8 vCPUs. The service is using LLaMA CPP Python.
I'm quite new to LLaMA-cpp, so excuse any mistakes. This is the relevant part of my script:
app: FastAPI = FastAPI(title=APP_NAME, version=APP_VERSION)
llama: Llama = Llama(model_path=MODEL_PATH, n_ctx=4096, n_batch=2048, n_threads=cpu_count())
response_model: Type[BaseModel] = model_from_typed_dict(ChatCompletion)
# APP FUNCTIONS
@[app.post](http://app.post/)("/api/chat", response_model=response_model)
async def chat(request: ChatCompletionsRequest) -> Union[ChatCompletion, EventSourceResponse]:
print("Chat-completion request received!")
completion_or_chunks: Union[ChatCompletion, Iterator[ChatCompletionChunk]] = llama.create_chat_completion(**request.dict(), max_tokens=4096)
completion: ChatCompletion = completion_or_chunks
print("Sending completion!")
return completion
I am getting this with Wizard-Vicuna-7B-Uncensored.ggmlv3.q4_0.bin, only when i send in embeddings from a vector db search results. inferences without retriever works fine with out this issue. I will try the regular llama 2 and see what happens. This is where it happens in the langchain.chains import ConversationalRetrievalChain
OK, i was able to make it work by reducing the number of docs to 1, any value above 1 throws the memory access violation
It would really help to diagnose this if you are able to reproduce it with one of the examples in this repository. If that's not possible, I would suggest looking into what parameters are being passed to llama_eval
. This could happen if n_tokens
is higher than n_batch
, or if n_tokens + n_past
is higher than n_ctx
.
I think, the issue maybe because of the special characters in the context. This was the context send to generate from llm.
i debugged it and intercepted the call before this text was sent to the llm. Copy pasted this into textpad to clean out the special characters and it seemed to be working.
The context is produced from a vector DB containing chunks of Tesla's 10K filings for the last 4 years. Looks like when the chunking was done, the special characters got in to the vector db and LLM was not bale to process the special characters.
The prompt that went with the context was "what are the risk factors for Tesla?"
binary_path: F:\ProgramData\Anaconda3\envs\scrapalot-research-assistant\lib\site-packages\bitsandbytes\cuda_setup\libbitsandbytes_cuda116.dll CUDA SETUP: Loading binary F:\ProgramData\Anaconda3\envs\scrapalot-research-assistant\lib\site-packages\bitsandbytes\cuda_setup\libbitsandbytes_cuda116.dll... ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce GTX 1070, compute capability 6.1 INFO: Started server process [24636] INFO: Waiting for application startup. llama.cpp: loading model from ./../llama.cpp/models/Vicuna/Wizard-Vicuna-7B-Uncensored.ggmlv3.q4_0.bin llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 4096 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_head_kv = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: n_gqa = 1 llama_model_load_internal: rnorm_eps = 1.0e-06 llama_model_load_internal: n_ff = 11008 llama_model_load_internal: freq_base = 10000.0 llama_model_load_internal: freq_scale = 1 llama_model_load_internal: ftype = 2 (mostly Q4_0) llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 0.08 MB llama_model_load_internal: using CUDA for GPU acceleration llama_model_load_internal: mem required = 596.40 MB (+ 2048.00 MB per state) llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 512 MB VRAM for the scratch buffer llama_model_load_internal: offloading 32 repeating layers to GPU llama_model_load_internal: offloading non-repeating layers to GPU llama_model_load_internal: offloading v cache to GPU llama_model_load_internal: offloading k cache to GPU llama_model_load_internal: offloaded 35/35 layers to GPU llama_model_load_internal: total VRAM used: 6106 MB llama_new_context_with_model: kv self size = 2048.00 MB AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
I think, the issue maybe because of the special characters in the context. This was the context send to generate from llm.
i debugged it and intercepted the call before this text was sent to the llm. Copy pasted this into textpad to clean out the special characters and it seemed to be working.
The context is produced from a vector DB containing chunks of Tesla's 10K filings for the last 4 years. Looks like when the chunking was done, the special characters got in to the vector db and LLM was not bale to process the special characters.
The prompt that went with the context was "what are the risk factors for Tesla?"
UPDATE: After extensive testing, i am at a conclusion that this is not caused by the special characters, this is caused by the amount of text that is being sent as context. I can comfortably send about 2000 characters(2K bytes) without this memory issue, sometimes even more(I think this depends on how much memory i have free...maybe).
I am using CUDA with an old GPU and older processor(AVX2=0), 32 GB of memory.
It would really help to diagnose this if you are able to reproduce it with one of the examples in this repository. If that's not possible, I would suggest looking into what parameters are being passed to
llama_eval
. This could happen ifn_tokens
is higher thann_batch
, or ifn_tokens + n_past
is higher thann_ctx
.
Is there an example, where i can send in a context and the prompt?
This issue was closed because it has been inactive for 14 days since being marked as stale.