llama.cpp Regression. Unable to run any model. CRASH!!!

Name and Version

llama-cli --version ggml_vulkan: Found 1 Vulkan devices: ggml_vulkan: 0 = AMD Radeon RX 7600 (RADV NAVI33) (radv) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 65536 | matrix cores: KHR_coopmat version: 4778 (a82c9e7c) built with clang version 18.1.1 for x86_64-unknown-linux-gnu

Operating systems

Linux

GGML backends

Vulkan

Hardware

RX 7600

Models

agentica-org_DeepScaleR-1.5B-Preview-Q8_0.gguf

Problem description & steps to reproduce

gdb llama-server GNU gdb (GDB) 13.2 Copyright (C) 2023 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-pc-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: https://www.gnu.org/software/gdb/bugs/. Find the GDB manual and other documentation resources online at: http://www.gnu.org/software/gdb/documentation/.

For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from llama-server... (No debugging symbols found in llama-server) (gdb) set args -t 1 --ctx-size 0 --no-kv-offload --port 8999 --n-predict 2048 --gpu-layers 128 -m ./LLM/agentica-org_DeepScaleR-1.5B-Preview-Q8_0.gguf (gdb) run Starting program: /usr/local/bin/llama-server -t 1 --ctx-size 0 --no-kv-offload --port 8999 --n-predict 2048 --gpu-layers 128 -m ./LLM/agentica-org_DeepScaleR-1.5B-Preview-Q8_0.gguf [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". [New Thread 0x7fffed16b700 (LWP 31050)] ggml_vulkan: Found 1 Vulkan devices: ggml_vulkan: 0 = AMD Radeon RX 7600 (RADV NAVI33) (radv) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 65536 | matrix cores: KHR_coopmat [New Thread 0x7fffec96a700 (LWP 31051)] build: 4778 (a82c9e7c) with clang version 18.1.1 for x86_64-unknown-linux-gnu system info: n_threads = 1, n_threads_batch = 1, total_threads = 8

[New Thread 0x7fffe7fff700 (LWP 31052)] [New Thread 0x7fffe77fe700 (LWP 31053)] [New Thread 0x7fffe6ffd700 (LWP 31054)] [New Thread 0x7fffe67fc700 (LWP 31055)] [New Thread 0x7fffe5ffb700 (LWP 31056)] [New Thread 0x7fffe57fa700 (LWP 31057)] [New Thread 0x7fffe4ff9700 (LWP 31058)] [New Thread 0x7fffdbfff700 (LWP 31059)] main: HTTP server is main: loading model srv load_model: loading llama_model_load_from_file_impl: llama_model_loader: llama_model_loader: llama_model_loader: - kv 0: llama_model_loader: - kv 1: llama_model_loader: - kv 2: llama_model_loader: - kv 3: llama_model_loader: - kv 4: llama_model_loader: - kv 5: llama_model_loader: - kv 6: llama_model_loader: - kv 7: llama_model_loader: - kv 8: llama_model_loader: - kv 9: llama_model_loader: - kv 10: llama_model_loader: - kv 11: llama_model_loader: - kv 12: llama_model_loader: - kv 13: llama_model_loader: - kv 14: llama_model_loader: - kv 15: llama_model_loader: - kv 16: llama_model_loader: - kv 17: llama_model_loader: - kv 18: llama_model_loader: - kv 19: llama_model_loader: - kv 20: llama_model_loader: - kv 21: llama_model_loader: - kv 22: llama_model_loader: - kv 23: llama_model_loader: - kv 24: llama_model_loader: - kv 25: llama_model_loader: - kv 26: llama_model_loader: - kv 27: llama_model_loader: - kv 28: llama_model_loader: - kv 29: llama_model_loader: - kv 30: llama_model_loader: - kv 31: llama_model_loader: - kv 32: llama_model_loader: - kv 33: llama_model_loader: - kv 34: llama_model_loader: - kv 35: llama_model_loader: - kv 36: llama_model_loader: - kv 37: llama_model_loader: - kv 38: llama_model_loader: - kv 39: llama_model_loader: - kv 40: llama_model_loader: - kv 41: llama_model_loader: - kv 42: llama_model_loader: - kv 43: llama_model_loader: - kv 44: llama_model_loader: - kv 45: llama_model_loader: - kv 46: llama_model_loader: - kv 47: llama_model_loader: - kv 48: llama_model_loader: - kv 49: llama_model_loader: - kv 50: llama_model_loader: - type llama_model_loader: - type q8_0: print_info: file format print_info: file type = Q8_0 print_info: file size load: special_eos_id load: special tokens cache size = 22 load: token to piece print_info: arch = qwen2 print_info: vocab_only = 0 print_info: n_ctx_train = 131072 print_info: n_embd = 1536 print_info: n_layer = 28 print_info: n_head = 12 print_info: n_head_kv = 2 print_info: n_rot = 128 print_info: n_swa = 0 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 6 print_info: n_embd_k_gqa = 256 print_info: n_embd_v_gqa = 256 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-06 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: n_ff = 8960 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 2 print_info: rope scaling = linear print_info: freq_base_train = 10000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 131072 print_info: rope_finetuned = unknown print_info: ssm_d_conv = 0 print_info: ssm_d_inner = 0 print_info: ssm_d_state = 0 print_info: ssm_dt_rank = 0 print_info: ssm_dt_b_c_rms = 0 print_info: model type = 1.5B print_info: model params = 1.78 B print_info: general.name print_info: vocab type = BPE print_info: n_vocab = 151936 print_info: n_merges = 151387 print_info: BOS token print_info: EOS token print_info: EOT token print_info: PAD token print_info: LF token = 198 'Ċ' print_info: FIM PRE token print_info: FIM SUF token print_info: FIM MID token print_info: FIM PAD token print_info: FIM REP token print_info: FIM SEP token print_info: EOG token print_info: EOG token print_info: EOG token print_info: EOG token print_info: max token length = 256 load_tensors: loading [New Thread 0x7fffdb35a700 (LWP 31060)] [New Thread 0x7fffd2b59700 (LWP 31061)] load_tensors: offloading load_tensors: offloading load_tensors: offloaded load_tensors: CPU_Mapped load_tensors: Vulkan0 ............................. llama_init_from_model: n_seq_max llama_init_from_model: n_ctx llama_init_from_model: llama_init_from_model: n_batch llama_init_from_model: n_ubatch llama_init_from_model: flash_attn llama_init_from_model: freq_base llama_init_from_model: freq_scale llama_kv_cache_init: llama_kv_cache_init: llama_init_from_model: KV self size llama_init_from_model: Vulkan_Host llama_init_from_model: llama_init_from_model: llama_init_from_model: graph nodes llama_init_from_model: common_init_from_params: common_init_from_params: [New Thread 0x7fffd37fe700 (LWP 31073)] [New Thread 0x7fffd2358700 (LWP 31075)] [New Thread 0x7fffd1b57700 (LWP 31076)] [Thread 0x7fffd2358700 [Thread 0x7fffd37fe700 [New Thread 0x7fffd1356700 (LWP 31077)] [New Thread 0x7fffd0b55700 (LWP 31078)] [Thread 0x7fffd1b57700 [Thread 0x7fffd1356700 [Thread 0x7fffd0b55700 terminate called after what(): unordered_map::at listening, hostname: 127.0.0.1, port: 8999, http threads: 7 model './LLM/agentica-org_DeepScaleR-1.5B-Preview-Q8_0.gguf' using device Vulkan0 (AMD Radeon RX 7600 (RADV NAVI33)) - 7936 MiB free loaded meta data with 51 key-value pairs and 339 tensors from ./LLM/agentica-org_DeepScaleR-1.5B-Preview-Q8_0.gguf (version GGUF V3 (latest)) Dumping metadata keys/values. Note: KV overrides do not apply in this output. general.architecture str = qwen2 general.type str = model general.name str = DeepScaleR 1.5B Preview general.organization str = Agentica Org general.finetune str = Preview general.basename str = DeepScaleR general.size_label str = 1.5B general.license str = mit general.base_model.count u32 = 1 general.base_model.0.name str = DeepSeek R1 Distill Qwen 1.5B general.base_model.0.organization str = Deepseek Ai general.base_model.0.repo_url str = https://huggingface.co/deepseek-ai/De... general.dataset.count u32 = 4 general.dataset.0.name str = NuminaMath CoT general.dataset.0.organization str = AI MO general.dataset.0.repo_url str = https://huggingface.co/AI-MO/NuminaMa... general.dataset.1.name str = Omni MATH general.dataset.1.organization str = KbsdJames general.dataset.1.repo_url str = https://huggingface.co/KbsdJames/Omni... general.dataset.2.name str = STILL 3 Preview RL Data general.dataset.2.organization str = RUC AIBOX general.dataset.2.repo_url str = https://huggingface.co/RUC-AIBOX/STIL... general.dataset.3.name str = Competition_Math general.dataset.3.organization str = Hendrycks general.dataset.3.repo_url str = https://huggingface.co/hendrycks/comp... general.languages arr[str,1] = ["en"] qwen2.block_count u32 = 28 qwen2.context_length u32 = 131072 qwen2.embedding_length u32 = 1536 qwen2.feed_forward_length u32 = 8960 qwen2.attention.head_count u32 = 12 qwen2.attention.head_count_kv u32 = 2 qwen2.rope.freq_base f32 = 10000.000000 qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 tokenizer.ggml.model str = gpt2 tokenizer.ggml.pre str = deepseek-r1-qwen tokenizer.ggml.tokens arr[str,151936] = ["!", """, "#", "$", "%", "&", "'", ... tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... tokenizer.ggml.bos_token_id u32 = 151646 tokenizer.ggml.eos_token_id u32 = 151643 tokenizer.ggml.padding_token_id u32 = 151643 tokenizer.ggml.add_bos_token bool = true tokenizer.ggml.add_eos_token bool = false tokenizer.chat_template str = {% if not add_generation_prompt is de... general.quantization_version u32 = 2 general.file_type u32 = 7 quantize.imatrix.file str = /models_out/DeepScaleR-1.5B-Preview-G... quantize.imatrix.dataset str = /training_dir/calibration_datav3.txt quantize.imatrix.entries_count i32 = 196 quantize.imatrix.chunks_count i32 = 128 f32: 141 tensors 198 tensors = GGUF V3 (latest) = 1.76 GiB (8.50 BPW) is not in special_eog_ids - the tokenizer config may be incorrect cache size = 0.9310 MB = DeepScaleR 1.5B Preview = 151646 '<｜begin▁of▁sentence｜>' = 151643 '<｜end▁of▁sentence｜>' = 151643 '<｜end▁of▁sentence｜>' = 151643 '<｜end▁of▁sentence｜>' = 151659 '<|fim_prefix|>' = 151661 '<|fim_suffix|>' = 151660 '<|fim_middle|>' = 151662 '<|fim_pad|>' = 151663 '<|repo_name|>' = 151664 '<|file_sep|>' = 151643 '<｜end▁of▁sentence｜>' = 151662 '<|fim_pad|>' = 151663 '<|repo_name|>' = 151664 '<|file_sep|>' model tensors, this can take a while... (mmap = true) 28 repeating layers to GPU output layer to GPU 29/29 layers to GPU model buffer size = 236.47 MiB model buffer size = 1564.62 MiB ............................................... = 1 = 131072 n_ctx_per_seq = 131072 = 2048 = 512 = 0 = 10000.0 = 1 kv_size = 131072, offload = 0, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1 CPU KV buffer size = 3584.00 MiB = 3584.00 MiB, K (f16): 1792.00 MiB, V (f16): 1792.00 MiB output buffer size = 0.58 MiB Vulkan0 compute buffer size = 302.75 MiB Vulkan_Host compute buffer size = 3332.01 MiB = 986 graph splits = 58 setting dry_penalty_last_n to ctx_size = 131072 warming up the model with an empty run - please wait ... (--no-warmup to disable) (LWP 31075) exited] (LWP 31073) exited] (LWP 31076) exited] (LWP 31077) exited] (LWP 31078) exited] throwing an instance of 'std::out_of_range'

Thread 1 "llama-server" received signal SIGABRT, Aborted. 0x00007ffff5471e35 in raise () from /lib64/libc.so.6 (gdb) where #0 0x00007ffff5471e35 in raise () from /lib64/libc.so.6 #1 0x00007ffff545c895 in abort () from /lib64/libc.so.6 #2 0x00007ffff56a2bf9 in __gnu_cxx::__verbose_terminate_handler () at ../../../../gcc/libstdc++-v3/libsupc++/vterminate.cc:95 #3 0x00007ffff56ae26a in __cxxabiv1::__terminate (handler=) at ../../../../gcc/libstdc++-v3/libsupc++/eh_terminate.cc:48 #4 0x00007ffff56ae2d5 in std::terminate () at ../../../../gcc/libstdc++-v3/libsupc++/eh_terminate.cc:58 #5 0x00007ffff56ae527 in __cxxabiv1::__cxa_throw (obj=, tinfo=0x7ffff58141c8 , dest=0x7ffff56c3440 std::out_of_range::~out_of_range()) at ../../../../gcc/libstdc++-v3/libsupc++/eh_throw.cc:98 #6 0x00007ffff56a5500 in std::__throw_out_of_range (__s=0x5555559b819a "unordered_map::at") at ../../../../../gcc/libstdc++-v3/src/c++11/functexcept.cc:86 #7 0x00005555559194e7 in ggml_pipeline_allocate_descriptor_sets(std::shared_ptr<vk_device_struct>&) () #8 0x00005555559398d9 in void ggml_vk_test_matmul<unsigned short, float>(ggml_backend_vk_context*, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, int, int) () #9 0x0000555555914ff4 in ggml_backend_vk_graph_compute(ggml_backend*, ggml_cgraph*) () #10 0x0000555555965417 in ggml_backend_sched_graph_compute_async () #11 0x000055555576fd14 in llama_graph_compute(llama_context&, ggml_cgraph*, int, ggml_threadpool*) () #12 0x000055555576c177 in llama_decode () #13 0x0000555555744d82 in common_init_from_params(common_params&) () #14 0x00005555555b6d05 in server_context::load_model(common_params const&) () #15 0x0000555555584b18 in main () (gdb)

First Bad Commit

I think last version that worked was 4743. I can't run any models which I was able to do with 4743 and below.

Relevant log output

terminate called after throwing an instance of 'std::out_of_range'
  what():  unordered_map::at

Feb 25 '25 22:02 acbits

Please try a clean build folder, and also check for any stale ggml-vulkan-shaders.* files in the source tree.

Feb 25 '25 22:02 jeffbolznv

Please try a clean build folder, and also check for any stale ggml-vulkan-shaders.* files in the source tree.

My RPM spec file for building. It uses a clean folder every time.

%global __python %{__python3}
%define git_tag b4778

Name:           llama-cpp
Version:        1.0.0
Release:        %{git_tag}
Summary:        local LLM inference engine

License:        MIT


%description
llama-cpp is a LLM inference engine.

# Set the source directory
%define source_dir /builddir/llama/

%prep
cd %{source_dir}
git checkout %{git_tag}

rm -rf build; mkdir build

cd build
export PKG_CONFIG_PATH=/usr/local/lib64/pkgconfig:/usr/local/share/pkgconfig
export CFLAGS=' -O3  -fno-omit-frame-pointer'
export CXXFLAGS=$CFLAGS

CXX=clang++ CC=clang cmake -DCMAKE_INSTALL_PREFIX=/usr/local \
            -DGGML_BLAS:BOOL=ON \
             -DGGML_BLAS_VENDOR=OpenBLAS \
             -DGGML_CCACHE=OFF  \
             -DGGML_VULKAN:BOOL=ON \
             -DGGML_VULKAN_RUN_TESTS=ON \
             -DBUILD_SHARED_LIBS:BOOL=OFF \
             -DCMAKE_BUILD_TYPE=Release ..

%build

cd %{source_dir}/build


nice make -j 64 -l 9

%install
rm -rf %{buildroot}
mkdir -p %{buildroot}

cd %{source_dir}/build

make DESTDIR=%{buildroot} install


%files
/*

Feb 25 '25 23:02 acbits

Did you delete any stale shaders files in the source tree? I was wondering if it was an issue like https://github.com/ggml-org/llama.cpp/issues/11788#issuecomment-2661877838 but somehow manifesting as a runtime shader compile failure rather than a build time failure.

If that's not it, can you try to bisect to a commit?

Feb 26 '25 00:02 jeffbolznv

I added a git clean -fdx in my build script. This is how it looks before the build.

find . -iname '*shaders*'
./ggml/src/ggml-kompute/kompute-shaders
./ggml/src/ggml-vulkan/vulkan-shaders
./ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp

After the build

find . -iname '*shaders*'
./ggml/src/ggml-kompute/kompute-shaders
./ggml/src/ggml-vulkan/vulkan-shaders
./ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp
./build/bin/vulkan-shaders-gen
./build/ggml/src/ggml-vulkan/CMakeFiles/ggml-vulkan.dir/ggml-vulkan-shaders.cpp.o
./build/ggml/src/ggml-vulkan/CMakeFiles/ggml-vulkan.dir/ggml-vulkan-shaders.cpp.o.d
./build/ggml/src/ggml-vulkan/ggml-vulkan-shaders.hpp
./build/ggml/src/ggml-vulkan/vulkan-shaders.spv
./build/ggml/src/ggml-vulkan/ggml-vulkan-shaders.cpp
./build/ggml/src/ggml-vulkan/vulkan-shaders
./build/ggml/src/ggml-vulkan/vulkan-shaders/CMakeFiles/vulkan-shaders-gen.dir
./build/ggml/src/ggml-vulkan/vulkan-shaders/CMakeFiles/vulkan-shaders-gen.dir/vulkan-shaders-gen.cpp.o.d
./build/ggml/src/ggml-vulkan/vulkan-shaders/CMakeFiles/vulkan-shaders-gen.dir/vulkan-shaders-gen.cpp.o

Even after rebuilding, the failure the same. I will bisect when I get the time or create a debug build and find the exact line number where it fails.

Feb 26 '25 01:02 acbits

The out of range error happens much later after a failed shader compile. There should have been a message to stderr about what shader failed to compile. I think bisecting will be helpful, but there's a good chance this is a driver bug of some sort.

Feb 26 '25 01:02 jeffbolznv

There are no other messages.

load_tensors:   CPU_Mapped model buffer size =   236.47 MiB
load_tensors:      Vulkan0 model buffer size =  1564.62 MiB
............................................................................
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 4096
llama_init_from_model: n_ctx_per_seq = 4096
llama_init_from_model: n_batch       = 2048
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 10000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1
llama_kv_cache_init:    Vulkan0 KV buffer size =   112.00 MiB
llama_init_from_model: KV self size  =  112.00 MiB, K (f16):   56.00 MiB, V (f16):   56.00 MiB
llama_init_from_model: Vulkan_Host  output buffer size =     0.58 MiB
llama_init_from_model:    Vulkan0 compute buffer size =   299.75 MiB
llama_init_from_model: Vulkan_Host compute buffer size =    11.01 MiB
llama_init_from_model: graph nodes  = 986
llama_init_from_model: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
terminate called after throwing an instance of 'std::out_of_range'
  what():  unordered_map::at

BTW, I am using Vulkan drivers from MESA.

Is there any way, I can dump the shader compilation logs?

Feb 26 '25 01:02 acbits

It's broken tests, you built with -DGGML_VULKAN_RUN_TESTS=ON, which is currently broken. But regardless of that, it should never be enabled on a build you want to use, it's just a number of internal unit tests for development purposes.

This is not the first case of someone enabling this, I guess the confusion is that it looks like a post-build validation test run?

Feb 26 '25 08:02 0cc4m

It's broken tests, you built with -DGGML_VULKAN_RUN_TESTS=ON, which is currently broken. But regardless of that, it should never be enabled on a build you want to use, it's just a number of internal unit tests for development purposes.

This is not the first case of someone enabling this, I guess the confusion is that it looks like a post-build validation test run?

Yes. That was it. I had enabled it thinking it is a post build validation test and I enabled it recently due to a few crashes at runtime related to Vulkan.

I think we should disable this option until it is stable.

Feb 26 '25 19:02 acbits

I already fixed it in a feature I'm working on, but it's not yet ready to be merged. But even if it had not crashed, your program would have just run the tests instead of whatever you wanted it to do.

Feb 26 '25 20:02 0cc4m