Regression. Unable to run any model. CRASH!!!
Name and Version
llama-cli --version ggml_vulkan: Found 1 Vulkan devices: ggml_vulkan: 0 = AMD Radeon RX 7600 (RADV NAVI33) (radv) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 65536 | matrix cores: KHR_coopmat version: 4778 (a82c9e7c) built with clang version 18.1.1 for x86_64-unknown-linux-gnu
Operating systems
Linux
GGML backends
Vulkan
Hardware
RX 7600
Models
agentica-org_DeepScaleR-1.5B-Preview-Q8_0.gguf
Problem description & steps to reproduce
gdb llama-server GNU gdb (GDB) 13.2 Copyright (C) 2023 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-pc-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: https://www.gnu.org/software/gdb/bugs/. Find the GDB manual and other documentation resources online at: http://www.gnu.org/software/gdb/documentation/.
For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from llama-server... (No debugging symbols found in llama-server) (gdb) set args -t 1 --ctx-size 0 --no-kv-offload --port 8999 --n-predict 2048 --gpu-layers 128 -m ./LLM/agentica-org_DeepScaleR-1.5B-Preview-Q8_0.gguf (gdb) run Starting program: /usr/local/bin/llama-server -t 1 --ctx-size 0 --no-kv-offload --port 8999 --n-predict 2048 --gpu-layers 128 -m ./LLM/agentica-org_DeepScaleR-1.5B-Preview-Q8_0.gguf [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". [New Thread 0x7fffed16b700 (LWP 31050)] ggml_vulkan: Found 1 Vulkan devices: ggml_vulkan: 0 = AMD Radeon RX 7600 (RADV NAVI33) (radv) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 65536 | matrix cores: KHR_coopmat [New Thread 0x7fffec96a700 (LWP 31051)] build: 4778 (a82c9e7c) with clang version 18.1.1 for x86_64-unknown-linux-gnu system info: n_threads = 1, n_threads_batch = 1, total_threads = 8
system_info: n_threads = 1 (n_threads_batch = 1) / 8 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
[New Thread 0x7fffe7fff700 (LWP 31052)] [New Thread 0x7fffe77fe700 (LWP 31053)] [New Thread 0x7fffe6ffd700 (LWP 31054)] [New Thread 0x7fffe67fc700 (LWP 31055)] [New Thread 0x7fffe5ffb700 (LWP 31056)] [New Thread 0x7fffe57fa700 (LWP 31057)] [New Thread 0x7fffe4ff9700 (LWP 31058)] [New Thread 0x7fffdbfff700 (LWP 31059)] main: HTTP server is listening, hostname: 127.0.0.1, port: 8999, http threads: 7 main: loading model srv load_model: loading model './LLM/agentica-org_DeepScaleR-1.5B-Preview-Q8_0.gguf' llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 7600 (RADV NAVI33)) - 7936 MiB free llama_model_loader: loaded meta data with 51 key-value pairs and 339 tensors from ./LLM/agentica-org_DeepScaleR-1.5B-Preview-Q8_0.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen2 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = DeepScaleR 1.5B Preview llama_model_loader: - kv 3: general.organization str = Agentica Org llama_model_loader: - kv 4: general.finetune str = Preview llama_model_loader: - kv 5: general.basename str = DeepScaleR llama_model_loader: - kv 6: general.size_label str = 1.5B llama_model_loader: - kv 7: general.license str = mit llama_model_loader: - kv 8: general.base_model.count u32 = 1 llama_model_loader: - kv 9: general.base_model.0.name str = DeepSeek R1 Distill Qwen 1.5B llama_model_loader: - kv 10: general.base_model.0.organization str = Deepseek Ai llama_model_loader: - kv 11: general.base_model.0.repo_url str = https://huggingface.co/deepseek-ai/De... llama_model_loader: - kv 12: general.dataset.count u32 = 4 llama_model_loader: - kv 13: general.dataset.0.name str = NuminaMath CoT llama_model_loader: - kv 14: general.dataset.0.organization str = AI MO llama_model_loader: - kv 15: general.dataset.0.repo_url str = https://huggingface.co/AI-MO/NuminaMa... llama_model_loader: - kv 16: general.dataset.1.name str = Omni MATH llama_model_loader: - kv 17: general.dataset.1.organization str = KbsdJames llama_model_loader: - kv 18: general.dataset.1.repo_url str = https://huggingface.co/KbsdJames/Omni... llama_model_loader: - kv 19: general.dataset.2.name str = STILL 3 Preview RL Data llama_model_loader: - kv 20: general.dataset.2.organization str = RUC AIBOX llama_model_loader: - kv 21: general.dataset.2.repo_url str = https://huggingface.co/RUC-AIBOX/STIL... llama_model_loader: - kv 22: general.dataset.3.name str = Competition_Math llama_model_loader: - kv 23: general.dataset.3.organization str = Hendrycks llama_model_loader: - kv 24: general.dataset.3.repo_url str = https://huggingface.co/hendrycks/comp... llama_model_loader: - kv 25: general.languages arr[str,1] = ["en"] llama_model_loader: - kv 26: qwen2.block_count u32 = 28 llama_model_loader: - kv 27: qwen2.context_length u32 = 131072 llama_model_loader: - kv 28: qwen2.embedding_length u32 = 1536 llama_model_loader: - kv 29: qwen2.feed_forward_length u32 = 8960 llama_model_loader: - kv 30: qwen2.attention.head_count u32 = 12 llama_model_loader: - kv 31: qwen2.attention.head_count_kv u32 = 2 llama_model_loader: - kv 32: qwen2.rope.freq_base f32 = 10000.000000 llama_model_loader: - kv 33: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 34: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 35: tokenizer.ggml.pre str = deepseek-r1-qwen llama_model_loader: - kv 36: tokenizer.ggml.tokens arr[str,151936] = ["!", """, "#", "$", "%", "&", "'", ... llama_model_loader: - kv 37: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 38: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 39: tokenizer.ggml.bos_token_id u32 = 151646 llama_model_loader: - kv 40: tokenizer.ggml.eos_token_id u32 = 151643 llama_model_loader: - kv 41: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 42: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 43: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 44: tokenizer.chat_template str = {% if not add_generation_prompt is de... llama_model_loader: - kv 45: general.quantization_version u32 = 2 llama_model_loader: - kv 46: general.file_type u32 = 7 llama_model_loader: - kv 47: quantize.imatrix.file str = /models_out/DeepScaleR-1.5B-Preview-G... llama_model_loader: - kv 48: quantize.imatrix.dataset str = /training_dir/calibration_datav3.txt llama_model_loader: - kv 49: quantize.imatrix.entries_count i32 = 196 llama_model_loader: - kv 50: quantize.imatrix.chunks_count i32 = 128 llama_model_loader: - type f32: 141 tensors llama_model_loader: - type q8_0: 198 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q8_0 print_info: file size = 1.76 GiB (8.50 BPW) load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect load: special tokens cache size = 22 load: token to piece cache size = 0.9310 MB print_info: arch = qwen2 print_info: vocab_only = 0 print_info: n_ctx_train = 131072 print_info: n_embd = 1536 print_info: n_layer = 28 print_info: n_head = 12 print_info: n_head_kv = 2 print_info: n_rot = 128 print_info: n_swa = 0 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 6 print_info: n_embd_k_gqa = 256 print_info: n_embd_v_gqa = 256 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-06 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: n_ff = 8960 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 2 print_info: rope scaling = linear print_info: freq_base_train = 10000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 131072 print_info: rope_finetuned = unknown print_info: ssm_d_conv = 0 print_info: ssm_d_inner = 0 print_info: ssm_d_state = 0 print_info: ssm_dt_rank = 0 print_info: ssm_dt_b_c_rms = 0 print_info: model type = 1.5B print_info: model params = 1.78 B print_info: general.name = DeepScaleR 1.5B Preview print_info: vocab type = BPE print_info: n_vocab = 151936 print_info: n_merges = 151387 print_info: BOS token = 151646 '<|begin▁of▁sentence|>' print_info: EOS token = 151643 '<|end▁of▁sentence|>' print_info: EOT token = 151643 '<|end▁of▁sentence|>' print_info: PAD token = 151643 '<|end▁of▁sentence|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 151659 '<|fim_prefix|>' print_info: FIM SUF token = 151661 '<|fim_suffix|>' print_info: FIM MID token = 151660 '<|fim_middle|>' print_info: FIM PAD token = 151662 '<|fim_pad|>' print_info: FIM REP token = 151663 '<|repo_name|>' print_info: FIM SEP token = 151664 '<|file_sep|>' print_info: EOG token = 151643 '<|end▁of▁sentence|>' print_info: EOG token = 151662 '<|fim_pad|>' print_info: EOG token = 151663 '<|repo_name|>' print_info: EOG token = 151664 '<|file_sep|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = true) [New Thread 0x7fffdb35a700 (LWP 31060)] [New Thread 0x7fffd2b59700 (LWP 31061)] load_tensors: offloading 28 repeating layers to GPU load_tensors: offloading output layer to GPU load_tensors: offloaded 29/29 layers to GPU load_tensors: CPU_Mapped model buffer size = 236.47 MiB load_tensors: Vulkan0 model buffer size = 1564.62 MiB ............................................................................ llama_init_from_model: n_seq_max = 1 llama_init_from_model: n_ctx = 131072 llama_init_from_model: n_ctx_per_seq = 131072 llama_init_from_model: n_batch = 2048 llama_init_from_model: n_ubatch = 512 llama_init_from_model: flash_attn = 0 llama_init_from_model: freq_base = 10000.0 llama_init_from_model: freq_scale = 1 llama_kv_cache_init: kv_size = 131072, offload = 0, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1 llama_kv_cache_init: CPU KV buffer size = 3584.00 MiB llama_init_from_model: KV self size = 3584.00 MiB, K (f16): 1792.00 MiB, V (f16): 1792.00 MiB llama_init_from_model: Vulkan_Host output buffer size = 0.58 MiB llama_init_from_model: Vulkan0 compute buffer size = 302.75 MiB llama_init_from_model: Vulkan_Host compute buffer size = 3332.01 MiB llama_init_from_model: graph nodes = 986 llama_init_from_model: graph splits = 58 common_init_from_params: setting dry_penalty_last_n to ctx_size = 131072 common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) [New Thread 0x7fffd37fe700 (LWP 31073)] [New Thread 0x7fffd2358700 (LWP 31075)] [New Thread 0x7fffd1b57700 (LWP 31076)] [Thread 0x7fffd2358700 (LWP 31075) exited] [Thread 0x7fffd37fe700 (LWP 31073) exited] [New Thread 0x7fffd1356700 (LWP 31077)] [New Thread 0x7fffd0b55700 (LWP 31078)] [Thread 0x7fffd1b57700 (LWP 31076) exited] [Thread 0x7fffd1356700 (LWP 31077) exited] [Thread 0x7fffd0b55700 (LWP 31078) exited] terminate called after throwing an instance of 'std::out_of_range' what(): unordered_map::at
Thread 1 "llama-server" received signal SIGABRT, Aborted.
0x00007ffff5471e35 in raise () from /lib64/libc.so.6
(gdb) where
#0 0x00007ffff5471e35 in raise () from /lib64/libc.so.6
#1 0x00007ffff545c895 in abort () from /lib64/libc.so.6
#2 0x00007ffff56a2bf9 in __gnu_cxx::__verbose_terminate_handler ()
at ../../../../gcc/libstdc++-v3/libsupc++/vterminate.cc:95
#3 0x00007ffff56ae26a in __cxxabiv1::__terminate (handler=
First Bad Commit
- I think last version that worked was 4743. I can't run any models which I was able to do with 4743 and below.
Relevant log output
terminate called after throwing an instance of 'std::out_of_range'
what(): unordered_map::at
Please try a clean build folder, and also check for any stale ggml-vulkan-shaders.* files in the source tree.
Please try a clean build folder, and also check for any stale ggml-vulkan-shaders.* files in the source tree.
My RPM spec file for building. It uses a clean folder every time.
%global __python %{__python3}
%define git_tag b4778
Name: llama-cpp
Version: 1.0.0
Release: %{git_tag}
Summary: local LLM inference engine
License: MIT
%description
llama-cpp is a LLM inference engine.
# Set the source directory
%define source_dir /builddir/llama/
%prep
cd %{source_dir}
git checkout %{git_tag}
rm -rf build; mkdir build
cd build
export PKG_CONFIG_PATH=/usr/local/lib64/pkgconfig:/usr/local/share/pkgconfig
export CFLAGS=' -O3 -fno-omit-frame-pointer'
export CXXFLAGS=$CFLAGS
CXX=clang++ CC=clang cmake -DCMAKE_INSTALL_PREFIX=/usr/local \
-DGGML_BLAS:BOOL=ON \
-DGGML_BLAS_VENDOR=OpenBLAS \
-DGGML_CCACHE=OFF \
-DGGML_VULKAN:BOOL=ON \
-DGGML_VULKAN_RUN_TESTS=ON \
-DBUILD_SHARED_LIBS:BOOL=OFF \
-DCMAKE_BUILD_TYPE=Release ..
%build
cd %{source_dir}/build
nice make -j 64 -l 9
%install
rm -rf %{buildroot}
mkdir -p %{buildroot}
cd %{source_dir}/build
make DESTDIR=%{buildroot} install
%files
/*
Did you delete any stale shaders files in the source tree? I was wondering if it was an issue like https://github.com/ggml-org/llama.cpp/issues/11788#issuecomment-2661877838 but somehow manifesting as a runtime shader compile failure rather than a build time failure.
If that's not it, can you try to bisect to a commit?
I added a git clean -fdx in my build script. This is how it looks before the build.
find . -iname '*shaders*'
./ggml/src/ggml-kompute/kompute-shaders
./ggml/src/ggml-vulkan/vulkan-shaders
./ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp
After the build
find . -iname '*shaders*'
./ggml/src/ggml-kompute/kompute-shaders
./ggml/src/ggml-vulkan/vulkan-shaders
./ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp
./build/bin/vulkan-shaders-gen
./build/ggml/src/ggml-vulkan/CMakeFiles/ggml-vulkan.dir/ggml-vulkan-shaders.cpp.o
./build/ggml/src/ggml-vulkan/CMakeFiles/ggml-vulkan.dir/ggml-vulkan-shaders.cpp.o.d
./build/ggml/src/ggml-vulkan/ggml-vulkan-shaders.hpp
./build/ggml/src/ggml-vulkan/vulkan-shaders.spv
./build/ggml/src/ggml-vulkan/ggml-vulkan-shaders.cpp
./build/ggml/src/ggml-vulkan/vulkan-shaders
./build/ggml/src/ggml-vulkan/vulkan-shaders/CMakeFiles/vulkan-shaders-gen.dir
./build/ggml/src/ggml-vulkan/vulkan-shaders/CMakeFiles/vulkan-shaders-gen.dir/vulkan-shaders-gen.cpp.o.d
./build/ggml/src/ggml-vulkan/vulkan-shaders/CMakeFiles/vulkan-shaders-gen.dir/vulkan-shaders-gen.cpp.o
Even after rebuilding, the failure the same. I will bisect when I get the time or create a debug build and find the exact line number where it fails.
The out of range error happens much later after a failed shader compile. There should have been a message to stderr about what shader failed to compile. I think bisecting will be helpful, but there's a good chance this is a driver bug of some sort.
There are no other messages.
load_tensors: CPU_Mapped model buffer size = 236.47 MiB
load_tensors: Vulkan0 model buffer size = 1564.62 MiB
............................................................................
llama_init_from_model: n_seq_max = 1
llama_init_from_model: n_ctx = 4096
llama_init_from_model: n_ctx_per_seq = 4096
llama_init_from_model: n_batch = 2048
llama_init_from_model: n_ubatch = 512
llama_init_from_model: flash_attn = 0
llama_init_from_model: freq_base = 10000.0
llama_init_from_model: freq_scale = 1
llama_init_from_model: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1
llama_kv_cache_init: Vulkan0 KV buffer size = 112.00 MiB
llama_init_from_model: KV self size = 112.00 MiB, K (f16): 56.00 MiB, V (f16): 56.00 MiB
llama_init_from_model: Vulkan_Host output buffer size = 0.58 MiB
llama_init_from_model: Vulkan0 compute buffer size = 299.75 MiB
llama_init_from_model: Vulkan_Host compute buffer size = 11.01 MiB
llama_init_from_model: graph nodes = 986
llama_init_from_model: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
terminate called after throwing an instance of 'std::out_of_range'
what(): unordered_map::at
BTW, I am using Vulkan drivers from MESA.
Is there any way, I can dump the shader compilation logs?
It's broken tests, you built with -DGGML_VULKAN_RUN_TESTS=ON, which is currently broken. But regardless of that, it should never be enabled on a build you want to use, it's just a number of internal unit tests for development purposes.
This is not the first case of someone enabling this, I guess the confusion is that it looks like a post-build validation test run?
It's broken tests, you built with
-DGGML_VULKAN_RUN_TESTS=ON, which is currently broken. But regardless of that, it should never be enabled on a build you want to use, it's just a number of internal unit tests for development purposes.This is not the first case of someone enabling this, I guess the confusion is that it looks like a post-build validation test run?
Yes. That was it. I had enabled it thinking it is a post build validation test and I enabled it recently due to a few crashes at runtime related to Vulkan.
I think we should disable this option until it is stable.
I already fixed it in a feature I'm working on, but it's not yet ready to be merged. But even if it had not crashed, your program would have just run the tests instead of whatever you wanted it to do.