llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

cudaErrorIllegalAddress (error 700) due to "an illegal memory access was encountered" on CUDA API call to cudaDeviceSynchronize.

Open aginies opened this issue 1 year ago • 3 comments

Name and Version

./llama-cli --version ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes version: 4693 (198b1ec6) built with cc (SUSE Linux) 7.5.0 for x86_64-suse-linux

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

No response

Command line

compute-sanitizer ./llama-bench -v --progress -ngl 30 -m /mnt/data/models/Qwen2.5-Coder-7B-Instruct-IQ3_XS.gguf

Problem description & steps to reproduce

I tried to use the docker container, got an Illegal memory access trying to use GPU. So I rebuild it with latest git code available, and try again, and also as the same issue. This affect all llama which are trying to use GPU.

OS: openSUSE Leap15.6, kernel 6.12.13-150600.23.25-default / 6.4.0-150600.23.33-default NVIDIA-SMI 570.86.10 Driver Version: 570.86.10 CUDA Version: 12.8 NVIDIA GeForce RTX 5090

export CC=/usr/bin/gcc-14
export CXX=/usr/bin/g++-14
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=120 -DGGML_CUDA_DISABLE_GRAPHS=ON
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
llama-bench: benchmark 1/2: starting
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 5090) - 31433 MiB free
llama_model_loader: loaded meta data with 38 key-value pairs and 339 tensors from /mnt/data/models/Qwen2.5-Coder-7B-Instruct-IQ3_XS.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen2.5 Coder 7B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Qwen2.5-Coder
llama_model_loader: - kv   5:                         general.size_label str              = 7B
llama_model_loader: - kv   6:                            general.license str              = apache-2.0
llama_model_loader: - kv   7:                       general.license.link str              = https://huggingface.co/Qwen/Qwen2.5-C...
llama_model_loader: - kv   8:                   general.base_model.count u32              = 1
llama_model_loader: - kv   9:                  general.base_model.0.name str              = Qwen2.5 Coder 7B
llama_model_loader: - kv  10:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  11:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2.5-C...
llama_model_loader: - kv  12:                               general.tags arr[str,6]       = ["code", "codeqwen", "chat", "qwen", ...
llama_model_loader: - kv  13:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  14:                          qwen2.block_count u32              = 28
llama_model_loader: - kv  15:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv  16:                     qwen2.embedding_length u32              = 3584
llama_model_loader: - kv  17:                  qwen2.feed_forward_length u32              = 18944
llama_model_loader: - kv  18:                 qwen2.attention.head_count u32              = 28
llama_model_loader: - kv  19:              qwen2.attention.head_count_kv u32              = 4
llama_model_loader: - kv  20:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  21:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  22:                          general.file_type u32              = 22
llama_model_loader: - kv  23:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  25:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  26:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  27:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  30:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  31:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  32:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  33:               general.quantization_version u32              = 2
llama_model_loader: - kv  34:                      quantize.imatrix.file str              = /models_out/Qwen2.5-Coder-7B-Instruct...
llama_model_loader: - kv  35:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  36:             quantize.imatrix.entries_count i32              = 196
llama_model_loader: - kv  37:              quantize.imatrix.chunks_count i32              = 128
llama_model_loader: - type  f32:  141 tensors
llama_model_loader: - type q4_K:   28 tensors
llama_model_loader: - type q6_K:    1 tensors
llama_model_loader: - type iq3_xxs:   98 tensors
llama_model_loader: - type iq3_s:   71 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = IQ3_XS - 3.3 bpw
print_info: file size   = 3.11 GiB (3.51 BPW) 
init_tokenizer: initializing tokenizer for type 2
load: control token: 151660 '<|fim_middle|>' is not marked as EOG
load: control token: 151659 '<|fim_prefix|>' is not marked as EOG
load: control token: 151653 '<|vision_end|>' is not marked as EOG
load: control token: 151648 '<|box_start|>' is not marked as EOG
load: control token: 151646 '<|object_ref_start|>' is not marked as EOG
load: control token: 151649 '<|box_end|>' is not marked as EOG
load: control token: 151655 '<|image_pad|>' is not marked as EOG
load: control token: 151651 '<|quad_end|>' is not marked as EOG
load: control token: 151647 '<|object_ref_end|>' is not marked as EOG
load: control token: 151652 '<|vision_start|>' is not marked as EOG
load: control token: 151654 '<|vision_pad|>' is not marked as EOG
load: control token: 151656 '<|video_pad|>' is not marked as EOG
load: control token: 151644 '<|im_start|>' is not marked as EOG
load: control token: 151661 '<|fim_suffix|>' is not marked as EOG
load: control token: 151650 '<|quad_start|>' is not marked as EOG
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch             = qwen2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 32768
print_info: n_embd           = 3584
print_info: n_layer          = 28
print_info: n_head           = 28
print_info: n_head_kv        = 4
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 7
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 18944
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 32768
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 7B
print_info: model params     = 7.62 B
print_info: general.name     = Qwen2.5 Coder 7B Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 152064
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: layer   0 assigned to device CUDA0
load_tensors: layer   1 assigned to device CUDA0
load_tensors: layer   2 assigned to device CUDA0
load_tensors: layer   3 assigned to device CUDA0
load_tensors: layer   4 assigned to device CUDA0
load_tensors: layer   5 assigned to device CUDA0
load_tensors: layer   6 assigned to device CUDA0
load_tensors: layer   7 assigned to device CUDA0
load_tensors: layer   8 assigned to device CUDA0
load_tensors: layer   9 assigned to device CUDA0
load_tensors: layer  10 assigned to device CUDA0
load_tensors: layer  11 assigned to device CUDA0
load_tensors: layer  12 assigned to device CUDA0
load_tensors: layer  13 assigned to device CUDA0
load_tensors: layer  14 assigned to device CUDA0
load_tensors: layer  15 assigned to device CUDA0
load_tensors: layer  16 assigned to device CUDA0
load_tensors: layer  17 assigned to device CUDA0
load_tensors: layer  18 assigned to device CUDA0
load_tensors: layer  19 assigned to device CUDA0
load_tensors: layer  20 assigned to device CUDA0
load_tensors: layer  21 assigned to device CUDA0
load_tensors: layer  22 assigned to device CUDA0
load_tensors: layer  23 assigned to device CUDA0
load_tensors: layer  24 assigned to device CUDA0
load_tensors: layer  25 assigned to device CUDA0
load_tensors: layer  26 assigned to device CUDA0
load_tensors: layer  27 assigned to device CUDA0
load_tensors: layer  28 assigned to device CUDA0
load_tensors: tensor 'token_embd.weight' (iq3_s) (and 0 others) cannot be used with preferred buffer type CPU_AARCH64, using CPU instead
load_tensors: offloading 28 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 29/29 layers to GPU
load_tensors:        CUDA0 model buffer size =  2962.23 MiB
load_tensors:   CPU_Mapped model buffer size =   223.33 MiB
.................................................................................
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 512
llama_init_from_model: n_ctx_per_seq = 512
llama_init_from_model: n_batch       = 512
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 1000000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_per_seq (512) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 512, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1
llama_kv_cache_init: layer 0: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 1: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 2: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 3: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 4: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 5: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 6: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 7: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 8: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 9: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 10: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 11: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 12: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 13: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 14: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 15: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 16: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 17: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 18: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 19: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 20: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 21: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 22: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 23: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 24: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 25: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 26: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 27: n_embd_k_gqa = 512, n_embd_v_gqa = 512
========= Internal Sanitizer Error: The Sanitizer failed to handle a hardware exception.
========= 
========= Program hit cudaErrorIllegalAddress (error 700) due to "an illegal memory access was encountered" on CUDA API call to cudaDeviceSynchronize.
=========     Saved host backtrace up to driver entry point at error
=========         Host Frame: ggml_backend_cuda_buffer_clear(ggml_backend_buffer*, unsigned char) [0xa6d58] in libggml-cuda.so
=========         Host Frame: llama_kv_cache_init(llama_kv_cache&, llama_model const&, llama_cparams const&, ggml_type, ggml_type, unsigned int, bool) [0xade7d] in libllama.so
=========         Host Frame: llama_init_from_model [0x5650b] in libllama.so
=========         Host Frame: main [0x225f2] in llama-bench
========= 
CUDA error: an illegal memory access was encountered
  current device: 0, in function ggml_backend_cuda_buffer_clear at /home/aginies/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:608
  cudaDeviceSynchronize()
/home/aginies/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:73: CUDA error

First Bad Commit

No response

Relevant log output


aginies avatar Feb 12 '25 15:02 aginies

Cuda gdb:

aginies@ryzen9:~/llama.cpp/build/bin> cuda-gdb --args ./llama-bench -ngl 30 -m /mnt/data/models/Qwen2.5-Coder-7B-Instruct-IQ3_XS.gguf 
NVIDIA (R) cuda-gdb 12.8
...
Reading symbols from ./llama-bench...
(cuda-gdb) run
Starting program: /home/aginies/llama.cpp/build/bin/llama-bench -ngl 30 -m /mnt/data/models/Qwen2.5-Coder-7B-Instruct-IQ3_XS.gguf
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7fffb03ff000 (LWP 10587)]
[New Thread 0x7fffaefff000 (LWP 10588)]
[Detaching after fork from child process 10589]
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
register_backend: registered backend CUDA (1 devices)
register_device: registered device CUDA0 (NVIDIA GeForce RTX 5090)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (AMD Ryzen 9 5950X 16-Core Processor)
warning: asserts enabled, performance may be affected
warning: debug build, performance may be affected
load_backend: failed to find ggml_backend_init in ./libggml-cuda.so
load_backend: failed to find ggml_backend_init in ./libggml-cpu.so
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
[New Thread 0x7fffad383000 (LWP 10610)]
[New Thread 0x7fffacb82000 (LWP 10611)]

CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x7ffd9974e460  memset32

Thread 1 "llama-bench" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 0, grid 1, block (97,0,0), thread (0,0,0), device 0, sm 79, warp 0, lane 0]
0x00007ffd9974e490 in memset32<<<(3584,1,1),(512,1,1)>>> ()
(cuda-gdb) continue
Continuing.
[Thread 0x7fffacb82000 (LWP 10611) exited]
[Thread 0x7fffad383000 (LWP 10610) exited]
[Thread 0x7fffaefff000 (LWP 10588) exited]
[Thread 0x7ffff7edb000 (LWP 10583) exited]
[Thread 0x7fffb03ff000 (LWP 10587) exited]
[New process 10583]

Program terminated with signal SIGKILL, Killed.
The program no longer exists.

aginies avatar Feb 12 '25 18:02 aginies

If the call to cudaMemset here was wrong, it would crash on every GPU. This is probably a driver bug that only affects this GPU.

slaren avatar Feb 12 '25 18:02 slaren

If the call to cudaMemset here was wrong, it would crash on every GPU. This is probably a driver bug that only affects this GPU.

Thanks for the comment. Sounds like i need to wait until Nvidia release a new driver. I have done a test under Win10 and i don't have the issue, the driver version is 572 instead of 570, so there is probably a bug in 570.

aginies avatar Feb 13 '25 07:02 aginies

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions[bot] avatar Mar 30 '25 01:03 github-actions[bot]

Issue also occurs on RTX PRO 6000 blackwell

ollama._types.ResponseError: an error was encountered while running the model: CUDA error: an illegal memory access was encountered
  current device: 0, in function ggml_backend_cuda_synchronize at //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:2667
  cudaStreamSynchronize(cuda_ctx->stream())
//ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:84: CUDA error (status code: -1)
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.65.06              Driver Version: 580.65.06      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX PRO 6000 Blac...    On  |   00000000:15:00.0 Off |                  Off |
| 30%   34C    P8              2W /  300W |   82306MiB /  97887MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA RTX PRO 6000 Blac...    On  |   00000000:95:00.0 Off |                  Off |
| 30%   43C    P8             19W /  300W |       3MiB /  97887MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

tjwebb avatar Oct 19 '25 05:10 tjwebb

+1 also facing this on RTX 2000 ADA, RTX 5070 TI

Sun Nov 23 18:43:58 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.105.08             Driver Version: 580.105.08     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5070 Ti     Off |   00000000:01:00.0 Off |                  N/A |
|  0%   47C    P1             41W /  300W |     319MiB /  16303MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA RTX 2000 Ada Gene...    Off |   00000000:03:00.0 Off |                  Off |
| 30%   48C    P0             17W /   70W |       5MiB /  16380MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            2629      G   ...rack-uuid=3190708988185955192        291MiB |
|    0   N/A  N/A           38589      G   resources                                 3MiB |
+-----------------------------------------------------------------------------------------+
❯ ollama -v
ollama version is 0.13.0

ovflowd avatar Nov 23 '25 17:11 ovflowd