llama-cpp-python Segmentation fault (core dumped) appearing randomly

Description: I'm experiencing random assertion failures and segmentation faults when streaming responses from a fine-tuned Llama3.1 70B GGUF model. The error occurs in the GGML matrix multiplication validation. Sometimes, it gives this GGML error, but most of the times, it just gives Segmentation fault (core dumped) and my pipeline crashes.

Environment:

llama_cpp_python version: 0.3.4
GPU: NVIDIA A40
Model: Custom fine-tuned Llama3.1 70B GGUF (originally fine-tuned with Unsloth at 4k context, running at 16k n_ctx)
OS: Ubuntu
Python version: 3.11

Error Log:

llama-cpp-python/llama-cpp-python/vendor/llama.cpp/ggml/src/ggml.c:3513: GGML_ASSERT(a->ne[2] == b->ne[0]) failed
Segmentation fault (core dumped)

Reproduction Steps:

Load fine-tuned 70B GGUF model with:

llm = Llama(
    model_path="llama3.1_70B_finetuned.Q4_K_M.gguf",
    n_ctx=16384,
    n_gpu_layers=-1,
    logits_all = True
)

Start streaming chat completion:

for chunk in llm.create_chat_completion(
    messages=[...],
    stream=True,
    max_tokens=1000
):
    print(chunk)

Error occurs randomly during streaming (usually after several successful chunks)

Additional Context:

The model was fine-tuned using Unsloth with 4k context length
Converted to GGUF using llama.cpp's convert script
Works fine for non-streaming inference
Error appears more frequently with longer context (>8k tokens)
Memory usage appears normal before crash (~80GB GPU mem for 70B Q4_K_M)

Debugging Attempts:

Tried different n_ctx values (4096, 8192, 16384)
Verified model integrity with llama.cpp's main example
Added thread locking around model access (no effect)

System Info:

Cuda 12.2
python 3.11

Request: Could you help investigate:

Potential causes for the GGML tensor dimension mismatch
Whether this relates to the context length difference between fine-tuning (4k) and inference (16k)
Any known issues with streaming large (70B) models

Apr 25 '25 12:04 AleefBilal

Is it reproducible with latest version ? (0.3.8)

Apr 26 '25 07:04 shamitv

@shamitv It is possible, but so far I've tested it with 0.2.9 and 0.3.4. I just want to know what could be the reasons behind this error. I've been using llama-cpp-python in many projects and for a long time, but it just occurs in one project where i am getting the output in a stream and calling the model again and again very fast (my use case is to get output from llama 70B as quick as possible.)

Apr 26 '25 12:04 AleefBilal

I remember having these kinds of issues in my early llama.cpp days. The reason back then was my former CPU (not GPU!) that was not able to handle certain instruction sets (mainly AVX2) which lead to segmentation faults. Having a more modern CPU solved the problems for me. Maybe this helps, maybe not.

Jul 12 '25 12:07 m-from-space

This is most probably a memory overflow issue. Since you said your input is streamed, meaning your memory and compute usage will be quite high, try with higher capacity CPU and RAM.

Aug 19 '25 10:08 mayanksinghobvs