Description

When attempting to set up llama cpp python for GPU support using CUDA toolkit, following the documented steps, the initialization of the llama-cpp model fails with an access violation error.

Steps to Reproduce

Install CUDA Toolkit v12.4
Setup Environment Variables:

set CMAKE_ARGS="-DGGML_CUDA=on" set FORCE_CMAKE=1

Uninstall and upgrade the llama-cpp-python(with numpy==1.26.4 to avoid other dependency issues):

poetry run pip install --force-reinstall --no-cache-dir llama-cpp-python "numpy==1.26.4"

The installation is successful and the GPU is detected but the model loading fails with the error:

llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from C:\Users\stormy101\Documents\private-gpt\models\mistral-7b-instruct-v0.2.Q4_K_M.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = mistralai_mistral-7b-instruct-v0.2 llama_model_loader: - kv 2: llama.context_length u32 = 32768 llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 llama_model_loader: - kv 4: llama.block_count u32 = 32 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 11: general.file_type u32 = 15 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["", "~~", "~~", "<0x00>", "<... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 22: tokenizer.chat_template str = {{ bos_token }}{% for message in mess... llama_model_loader: - kv 23: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_K: 193 tensors llama_model_loader: - type q6_K: 33 tensors llm_load_vocab: special tokens cache size = 259 llm_load_vocab: token to piece cache size = 0.1637 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 7.24 B llm_load_print_meta: model size = 4.07 GiB (4.83 BPW) llm_load_print_meta: general.name = mistralai_mistral-7b-instruct-v0.2 llm_load_print_meta: BOS token = 1 '~~' llm_load_print_meta: EOS token = 2 '~~' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: PAD token = 0 '' llm_load_print_meta: LF token = 13 '<0x0A>' ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA RTX 2000 Ada Generation Laptop GPU, compute capability 8.9, VMM: yes Traceback (most recent call last):

File "C:\Users\stormy101\Documents\private-gpt\private_gpt\components\llm\llm_component.py", line 57, in init self.llm = LlamaCPP( ^^^^^^^^^ File "C:\Users\stormy101\AppData\Local\anaconda3\envs\privategpt\Lib\site-packages\llama_index\llms\llama_cpp\base.py", line 109, in init self._model = Llama(model_path=model_path, **model_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\stormy101\AppData\Local\anaconda3\envs\privategpt\Lib\site-packages\llama_cpp\llama.py", line 349, in init self._model = _LlamaModel( ^^^^^^^^^^^^ File "C:\Users\stormy101\AppData\Local\anaconda3\envs\privategpt\Lib\site-packages\llama_cpp_internals.py", line 52, in init self.model = llama_cpp.llama_load_model_from_file( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ OSError: exception: access violation reading 0x0000000000000000

Environment Details:

Python: 3.11.9
CUDA Toolkit Version: CUDA 12.4
OS: Windows 11

I am able to load the model on CPU. I tried to downgrade to 0.2.78 but the error persists. Need your help to resolve the issue. Thank you.

Jul 08 '24 19:07 Sanjit0910

I have the same issue.

Description

When attempting to set up llama cpp python for GPU support using CUDA toolkit, following the documented steps, the initialization of the llama-cpp model fails with an access violation error.

Steps to Reproduce

Install CUDA Toolkit v12.5

Create conda environment and install llama_cpp:

conda create -n llama_clean conda activate llama_clean conda install pip set CMAKE_ARGS=-DGGML_CUDA=on set FORCE_CMAKE=1 cd C:\Users\User\anaconda3\envs\llama_clean Scripts\pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir

Start python and try to load model

from llama_cpp import Llama model = Llama(model_path="models/generator/Mistral-7B-Instruct-v0.3.Q6_K.gguf", n_ctx=2048, n_gpu_layers=999, embedding=False) llama_model_loader: loaded meta data with 29 key-value pairs and 291 tensors from models/generator/Mistral-7B-Instruct-v0.3.Q6_K.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = models--mistralai--Mistral-7B-Instruc... llama_model_loader: - kv 2: llama.block_count u32 = 32 llama_model_loader: - kv 3: llama.context_length u32 = 32768 llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.attention.head_count u32 = 32 llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 8: llama.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 18 llama_model_loader: - kv 11: llama.vocab_size u32 = 32768 llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 13: tokenizer.ggml.model str = llama llama_model_loader: - kv 14: tokenizer.ggml.pre str = default llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,32768] = ["", "~~", "~~", "[INST]", "[... llama_model_loader: - kv 16: tokenizer.ggml.scores arr[f32,32768] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,32768] = [2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ... llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 20: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 21: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 22: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 23: tokenizer.chat_template str = {{ bos_token }}{% for message in mess... llama_model_loader: - kv 24: general.quantization_version u32 = 2 llama_model_loader: - kv 25: quantize.imatrix.file str = ./imatrix.dat llama_model_loader: - kv 26: quantize.imatrix.dataset str = group_40.txt llama_model_loader: - kv 27: quantize.imatrix.entries_count i32 = 224 llama_model_loader: - kv 28: quantize.imatrix.chunks_count i32 = 74 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q6_K: 226 tensors llm_load_vocab: special tokens cache size = 771 llm_load_vocab: token to piece cache size = 0.1731 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32768 llm_load_print_meta: n_merges = 0 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = Q6_K llm_load_print_meta: model params = 7.25 B llm_load_print_meta: model size = 5.54 GiB (6.56 BPW) llm_load_print_meta: general.name = models--mistralai--Mistral-7B-Instruct-v0.3 llm_load_print_meta: BOS token = 1 '~~' llm_load_print_meta: EOS token = 2 '~~' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: LF token = 781 '<0x0A>' llm_load_print_meta: max token length = 48 Exception ignored in: <function Llama.del at 0x00000176EE09DC60> Traceback (most recent call last): File "C:\Users\User\anaconda3\envs\llama_clean\Lib\site-packages\llama_cpp\llama.py", line 2089, in del if self._lora_adapter is not None: ^^^^^^^^^^^^^^^^^^ AttributeError: 'Llama' object has no attribute '_lora_adapter' Traceback (most recent call last): File "", line 1, in File "C:\Users\User\anaconda3\envs\llama_clean\Lib\site-packages\llama_cpp\llama.py", line 372, in init _LlamaModel( File "C:\Users\User\anaconda3\envs\llama_clean\Lib\site-packages\llama_cpp_internals.py", line 50, in init self.model = llama_cpp.llama_load_model_from_file( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ OSError: exception: access violation reading 0x0000000000000000

Environment Details:

Python: 3.12.4 CUDA Toolkit Version: CUDA 12.5 OS: Windows 10

Additional

When installing without CUDA there is no problem. Using gpu_layers=0 but with CUDA installation does not solve the issue. The issue is independent from the model used.

Jul 23 '24 10:07 Asgir

An update that may help in narrowing this down:

Under windows 11:

building llama.cpp with cmake and then installing llama_cpp_python with linked library still causes the issue.
calling llama-cli (with llama.cpp built from previous step) works fine.

So I guess either the problem is with the python-bindings or the llama.dll, but in principle it should be able to work. Does someone maybe have some minimal python-bindings just for loading the model? Would be useful in debugging.

Under WSL:

building llama.cpp with cmake and then installing llama_cpp_python with linked library works fine.
calling llama-cli (with llama.cpp built from previous step) works fine.

So under WSL everything works fine. Maybe a (rather cumbersome) workaround if windows does not work.

Jul 25 '24 10:07 Asgir

Description

When attempting to set up llama cpp python for GPU support using CUDA toolkit, following the documented steps, the initialization of the llama-cpp model fails with an access violation error.

Steps to Reproduce

Install CUDA Toolkit v12.4

Setup Environment Variables:

set CMAKE_ARGS="-DGGML_CUDA=on" set FORCE_CMAKE=1

Uninstall and upgrade the llama-cpp-python(with numpy==1.26.4 to avoid other dependency issues):

poetry run pip install --force-reinstall --no-cache-dir llama-cpp-python "numpy==1.26.4"

The installation is successful and the GPU is detected but the model loading fails with the error:

llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from C:\Users\stormy101\Documents\private-gpt\models\mistral-7b-instruct-v0.2.Q4_K_M.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = mistralai_mistral-7b-instruct-v0.2 llama_model_loader: - kv 2: llama.context_length u32 = 32768 llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 llama_model_loader: - kv 4: llama.block_count u32 = 32 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 11: general.file_type u32 = 15 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["", "", "", "<0x00>", "<... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 22: tokenizer.chat_template str = {{ bos_token }}{% for message in mess... llama_model_loader: - kv 23: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_K: 193 tensors llama_model_loader: - type q6_K: 33 tensors llm_load_vocab: special tokens cache size = 259 llm_load_vocab: token to piece cache size = 0.1637 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 7.24 B llm_load_print_meta: model size = 4.07 GiB (4.83 BPW) llm_load_print_meta: general.name = mistralai_mistral-7b-instruct-v0.2 llm_load_print_meta: BOS token = 1 '' llm_load_print_meta: EOS token = 2 '' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: PAD token = 0 '' llm_load_print_meta: LF token = 13 '<0x0A>' ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA RTX 2000 Ada Generation Laptop GPU, compute capability 8.9, VMM: yes Traceback (most recent call last):

File "C:\Users\stormy101\Documents\private-gpt\private_gpt\components\llm\llm_component.py", line 57, in init self.llm = LlamaCPP( ^^^^^^^^^ File "C:\Users\stormy101\AppData\Local\anaconda3\envs\privategpt\Lib\site-packages\llama_index\llms\llama_cpp\base.py", line 109, in init self._model = Llama(model_path=model_path, **model_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\stormy101\AppData\Local\anaconda3\envs\privategpt\Lib\site-packages\llama_cpp\llama.py", line 349, in init self._model = _LlamaModel( ^^^^^^^^^^^^ File "C:\Users\stormy101\AppData\Local\anaconda3\envs\privategpt\Lib\site-packages\llama_cpp_internals.py", line 52, in init self.model = llama_cpp.llama_load_model_from_file( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ OSError: exception: access violation reading 0x0000000000000000

Environment Details:

Python: 3.11.9

CUDA Toolkit Version: CUDA 12.4

OS: Windows 11

I am able to load the model on CPU. I tried to downgrade to 0.2.78 but the error persists. Need your help to resolve the issue. Thank you.

Yep. Same here. Same version, CUDA version and OS

Jul 26 '24 02:07 Devnant

Same issue here, with Windows 10 and Vulkan backend.

Jul 28 '24 18:07 stduhpf

Same here 🤣 Been stuck installing for a week, so happy when I did it then got crushed by this error a min after Anyone actually get this working in windows? with GPU, of course🤣

Windows 10 and CUDA 12.4 Visual Studio

Aug 01 '24 19:08 kot197

I had the same problem, for me it was the pandas library that i have imported before. For some reason if you are doing imports, import llama_cpp first and pandas second. Hope it helps!

Aug 02 '24 07:08 wiktorwysockig5

Any updates here ? I have the same issue with CUDA 12.4 and llama-cpp-python version 0.2.85.

Aug 05 '24 15:08 MatKollar

the same situation. CUDA 12.5 and llama-cpp-python version 0.2.85. I downgraded numpy to 1.26.4 for dependency reason but it doesn't help. From llama.cpp from cmd it works propertly. But in Jupyter I have the Error:

OSError: exception: access violation reading 0x0000000000000000

UPDATE. Solution for me are based on:

re-install NVIDIA CUDA Toolkit to short path (compare to original with many symbols and spaces)
use base cmd for compailing instead of Develompmnt Shell or CMD of VC

Use this compiler x64 by default (change your version of VC in path string) or use set command in cmd 3. set path="C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.40.33807\bin\HostX64\x64";%path% 4. downgrade numpy pip install --upgrade --force-reinstall numpy==1.26.4 5. and after that follow the readme: set FORCE_CMAKE=1 set CMAKE_ARGS=-DGGML_CUDA=ON pip install -e .

Now I see the right status in Juyter after model loaded:

Device 0: NVIDIA GeForce RTX 4080 Laptop GPU, compute capability 8.9, VMM: yes llm_load_tensors: ggml ctx size = 0.27 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CPU buffer size = 532.31 MiB llm_load_tensors: CUDA0 buffer size = 7605.34 MiB

Aug 11 '24 18:08 tdiz

Same issue, with Windows 11 and Vulkan backend.If under WSL2 everything works fine. I met some C4819 warnings when compiling llamacpp in win:

C:\Users\MrHPX\AppData\Local\Temp\pip-install-ayd48vfc\llama-cpp-python_5bb0eefaaa8d4fb79a7f6137bc50d639\vendor\llama.cpp\src\llama.cpp(21510,1): warning C4819: 璇ユ枃浠跺寘鍚\xab涓嶈兘鍦ㄥ綋鍓嶄唬鐮侀〉(936)涓\xad琛ㄧず鐨勫瓧绗︺\x80傝\xaf峰皢璇ユ枃浠朵繚瀛樹负 Unicode 鏍煎紡浠ラ槻姝㈡暟鎹\xae涓㈠け [C:\Users\MrHPX\AppData\Local\Temp\tmpijba2clf\build\vendor\llama.cpp\src\llama.vcxproj]
    llama-vocab.cpp
  C:\Users\MrHPX\AppData\Local\Temp\pip-install-ayd48vfc\llama-cpp-python_5bb0eefaaa8d4fb79a7f6137bc50d639\vendor\llama.cpp\src\llama-vocab.cpp(1,1): warning C4819: 璇ユ枃浠跺寘鍚\xab涓嶈兘鍦ㄥ綋鍓嶄唬鐮侀〉(936)涓\xad琛ㄧず鐨勫瓧绗︺\x80傝\xaf峰皢璇ユ枃浠朵繚瀛樹负 Unicode 鏍煎紡浠ラ槻姝㈡暟鎹\xae涓㈠け [C:\Users\MrHPX\AppData\Local\Temp\tmpijba2clf\build\vendor\llama.cpp\src\llama.vcxproj]

Oct 04 '24 05:10 hpx502766238

Same error. Is there any resolution to fix the problem? Thanks

Oct 19 '24 15:10 csaiedu

Same here. This happens regardless of the backend (I just installed via pip install llama-cpp-python). So references to CUDA can be removed from this issue.

Nov 18 '24 03:11 VelocityRa

I had the same problem, for me it was the pandas library that i have imported before. For some reason if you are doing imports, import llama_cpp first and pandas second. Hope it helps!

I have something similar. When I import pandas before importing llama_cpp, it throws this violation error. If I import llama_cpp before, it all works fine. Unfortunately, in my code, I cannot import llama_cpp before :( So I'm still looking for a fix 😃

Dec 17 '24 13:12 LucasAubrunHKH

Same issue, but when I run code in a jupyter notebook, which probably has a similar cause as with other imports

Dec 17 '24 19:12 ssslakter

I had the same problem, for me it was the pandas library that i have imported before. For some reason if you are doing imports, import llama_cpp first and pandas second. Hope it helps!

Ah you are the life saver

Jan 18 '25 05:01 Rei-Taylor

For me the issue under windows 11 seems to have been resolved with the latest updates (no idea which). Installing via pip with CUDA works fine now:

Create conda environment and install llama_cpp: conda create -n llama_clean conda activate llama_clean conda install pip set CMAKE_ARGS=-DGGML_CUDA=on set FORCE_CMAKE=1 cd C:\Users\User\anaconda3\envs\llama_clean Scripts\pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir

Start python and load model: from llama_cpp import Llama model = Llama(model_path="models/generator/Mistral-7B-Instruct-v0.3.Q6_K.gguf", n_ctx=2048, n_gpu_layers=999, embedding=False

This loads the model correctly into VRAM of the GPU and inference runs perfectly fine with CUDA.

Jan 18 '25 10:01 Asgir

Which modules do you load before importing llama_cp? I found a bug: if you load pandas before llama, you get "OSError: exception: access violation reading 0x0000000000000000"

Jan 24 '25 11:01 andretisch

Usually nothing, but I tried pandas and it works fine too.

Jan 24 '25 18:01 Asgir

I get "access violation reading 0x0000000000000000" every time I call llama_backend_init()

I removed cuda and installed it to c:\Cuda --> didn't help
downgraded llama_cpp_python to 0.3.15 and 0.3.10 --> didn't help
downloaded the precompiled library --> didn't help
tried with and without Cuda Flag (GGML_CUDA) --> didin't help
tried without any other imports --> didn't help

Anything else I can try? I use windows 11, VS 2022, Python 3.10.9, llama_cpp_python 0.3.16

Sep 25 '25 14:09 HoffmannTom

Short update: using precompiled wheel works: pip install llama-cpp-python==0.3.2 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu
So it has probably something to do with the compiler on windows or compiler settings. The resulting binary seems to be broken in my case. However, it is an older pre-built binary version.

Sep 25 '25 14:09 HoffmannTom

Windows 10 Python 3.10.11 Running in Python venv Using llama.cpp locally built with OpenBLAS llama_cpp_python 0.3.16

Got this: File "C:\MyProject\venv\lib\site-packages\llama_cpp\llama.py", line 2361, in from_pretrained return cls( File "C:\MyProject\venv\lib\site-packages\llama_cpp\llama.py", line 206, in init llama_cpp.llama_backend_init() OSError: exception: access violation reading 0x0000000000000000

Oct 19 '25 18:10 OnCodeDeny

Windows 10 Python 3.10.11 Running in Python venv Using llama.cpp locally built with OpenBLAS llama_cpp_python 0.3.16

Got this: File "C:\MyProject\venv\lib\site-packages\llama_cpp\llama.py", line 2361, in from_pretrained return cls( File "C:\MyProject\venv\lib\site-packages\llama_cpp\llama.py", line 206, in init llama_cpp.llama_backend_init() OSError: exception: access violation reading 0x0000000000000000

SOLUTION(for me):

Open x64 Native Tools Command Prompt for VS and install llama-cpp-python from there. This provides the right environment for building dlls. Incorrect environment can cause silent issues during build time which will appear during run time.

Nov 05 '25 04:11 OnCodeDeny

Trying to load llm model using llama cpp python with GPU support fails with an OSError: exception: access violation reading 0x0000000000000000

Description

Steps to Reproduce

Environment Details:

Description

Steps to Reproduce

Install CUDA Toolkit v12.5

Create conda environment and install llama_cpp:

Start python and try to load model

Environment Details:

Additional

Description

Steps to Reproduce

Environment Details:

SOLUTION(for me):