llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Regression. Unable to run any model. CRASH!!!

Open acbits opened this issue 10 months ago • 7 comments

Name and Version

llama-cli --version ggml_vulkan: Found 1 Vulkan devices: ggml_vulkan: 0 = AMD Radeon RX 7600 (RADV NAVI33) (radv) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 65536 | matrix cores: KHR_coopmat version: 4778 (a82c9e7c) built with clang version 18.1.1 for x86_64-unknown-linux-gnu

Operating systems

Linux

GGML backends

Vulkan

Hardware

RX 7600

Models

agentica-org_DeepScaleR-1.5B-Preview-Q8_0.gguf

Problem description & steps to reproduce

gdb llama-server GNU gdb (GDB) 13.2 Copyright (C) 2023 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-pc-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: https://www.gnu.org/software/gdb/bugs/. Find the GDB manual and other documentation resources online at: http://www.gnu.org/software/gdb/documentation/.

For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from llama-server... (No debugging symbols found in llama-server) (gdb) set args -t 1 --ctx-size 0 --no-kv-offload --port 8999 --n-predict 2048 --gpu-layers 128 -m ./LLM/agentica-org_DeepScaleR-1.5B-Preview-Q8_0.gguf (gdb) run Starting program: /usr/local/bin/llama-server -t 1 --ctx-size 0 --no-kv-offload --port 8999 --n-predict 2048 --gpu-layers 128 -m ./LLM/agentica-org_DeepScaleR-1.5B-Preview-Q8_0.gguf [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". [New Thread 0x7fffed16b700 (LWP 31050)] ggml_vulkan: Found 1 Vulkan devices: ggml_vulkan: 0 = AMD Radeon RX 7600 (RADV NAVI33) (radv) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 65536 | matrix cores: KHR_coopmat [New Thread 0x7fffec96a700 (LWP 31051)] build: 4778 (a82c9e7c) with clang version 18.1.1 for x86_64-unknown-linux-gnu system info: n_threads = 1, n_threads_batch = 1, total_threads = 8

system_info: n_threads = 1 (n_threads_batch = 1) / 8 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

[New Thread 0x7fffe7fff700 (LWP 31052)] [New Thread 0x7fffe77fe700 (LWP 31053)] [New Thread 0x7fffe6ffd700 (LWP 31054)] [New Thread 0x7fffe67fc700 (LWP 31055)] [New Thread 0x7fffe5ffb700 (LWP 31056)] [New Thread 0x7fffe57fa700 (LWP 31057)] [New Thread 0x7fffe4ff9700 (LWP 31058)] [New Thread 0x7fffdbfff700 (LWP 31059)] main: HTTP server is listening, hostname: 127.0.0.1, port: 8999, http threads: 7 main: loading model srv load_model: loading model './LLM/agentica-org_DeepScaleR-1.5B-Preview-Q8_0.gguf' llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 7600 (RADV NAVI33)) - 7936 MiB free llama_model_loader: loaded meta data with 51 key-value pairs and 339 tensors from ./LLM/agentica-org_DeepScaleR-1.5B-Preview-Q8_0.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen2 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = DeepScaleR 1.5B Preview llama_model_loader: - kv 3: general.organization str = Agentica Org llama_model_loader: - kv 4: general.finetune str = Preview llama_model_loader: - kv 5: general.basename str = DeepScaleR llama_model_loader: - kv 6: general.size_label str = 1.5B llama_model_loader: - kv 7: general.license str = mit llama_model_loader: - kv 8: general.base_model.count u32 = 1 llama_model_loader: - kv 9: general.base_model.0.name str = DeepSeek R1 Distill Qwen 1.5B llama_model_loader: - kv 10: general.base_model.0.organization str = Deepseek Ai llama_model_loader: - kv 11: general.base_model.0.repo_url str = https://huggingface.co/deepseek-ai/De... llama_model_loader: - kv 12: general.dataset.count u32 = 4 llama_model_loader: - kv 13: general.dataset.0.name str = NuminaMath CoT llama_model_loader: - kv 14: general.dataset.0.organization str = AI MO llama_model_loader: - kv 15: general.dataset.0.repo_url str = https://huggingface.co/AI-MO/NuminaMa... llama_model_loader: - kv 16: general.dataset.1.name str = Omni MATH llama_model_loader: - kv 17: general.dataset.1.organization str = KbsdJames llama_model_loader: - kv 18: general.dataset.1.repo_url str = https://huggingface.co/KbsdJames/Omni... llama_model_loader: - kv 19: general.dataset.2.name str = STILL 3 Preview RL Data llama_model_loader: - kv 20: general.dataset.2.organization str = RUC AIBOX llama_model_loader: - kv 21: general.dataset.2.repo_url str = https://huggingface.co/RUC-AIBOX/STIL... llama_model_loader: - kv 22: general.dataset.3.name str = Competition_Math llama_model_loader: - kv 23: general.dataset.3.organization str = Hendrycks llama_model_loader: - kv 24: general.dataset.3.repo_url str = https://huggingface.co/hendrycks/comp... llama_model_loader: - kv 25: general.languages arr[str,1] = ["en"] llama_model_loader: - kv 26: qwen2.block_count u32 = 28 llama_model_loader: - kv 27: qwen2.context_length u32 = 131072 llama_model_loader: - kv 28: qwen2.embedding_length u32 = 1536 llama_model_loader: - kv 29: qwen2.feed_forward_length u32 = 8960 llama_model_loader: - kv 30: qwen2.attention.head_count u32 = 12 llama_model_loader: - kv 31: qwen2.attention.head_count_kv u32 = 2 llama_model_loader: - kv 32: qwen2.rope.freq_base f32 = 10000.000000 llama_model_loader: - kv 33: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 34: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 35: tokenizer.ggml.pre str = deepseek-r1-qwen llama_model_loader: - kv 36: tokenizer.ggml.tokens arr[str,151936] = ["!", """, "#", "$", "%", "&", "'", ... llama_model_loader: - kv 37: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 38: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 39: tokenizer.ggml.bos_token_id u32 = 151646 llama_model_loader: - kv 40: tokenizer.ggml.eos_token_id u32 = 151643 llama_model_loader: - kv 41: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 42: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 43: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 44: tokenizer.chat_template str = {% if not add_generation_prompt is de... llama_model_loader: - kv 45: general.quantization_version u32 = 2 llama_model_loader: - kv 46: general.file_type u32 = 7 llama_model_loader: - kv 47: quantize.imatrix.file str = /models_out/DeepScaleR-1.5B-Preview-G... llama_model_loader: - kv 48: quantize.imatrix.dataset str = /training_dir/calibration_datav3.txt llama_model_loader: - kv 49: quantize.imatrix.entries_count i32 = 196 llama_model_loader: - kv 50: quantize.imatrix.chunks_count i32 = 128 llama_model_loader: - type f32: 141 tensors llama_model_loader: - type q8_0: 198 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q8_0 print_info: file size = 1.76 GiB (8.50 BPW) load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect load: special tokens cache size = 22 load: token to piece cache size = 0.9310 MB print_info: arch = qwen2 print_info: vocab_only = 0 print_info: n_ctx_train = 131072 print_info: n_embd = 1536 print_info: n_layer = 28 print_info: n_head = 12 print_info: n_head_kv = 2 print_info: n_rot = 128 print_info: n_swa = 0 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 6 print_info: n_embd_k_gqa = 256 print_info: n_embd_v_gqa = 256 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-06 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: n_ff = 8960 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 2 print_info: rope scaling = linear print_info: freq_base_train = 10000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 131072 print_info: rope_finetuned = unknown print_info: ssm_d_conv = 0 print_info: ssm_d_inner = 0 print_info: ssm_d_state = 0 print_info: ssm_dt_rank = 0 print_info: ssm_dt_b_c_rms = 0 print_info: model type = 1.5B print_info: model params = 1.78 B print_info: general.name = DeepScaleR 1.5B Preview print_info: vocab type = BPE print_info: n_vocab = 151936 print_info: n_merges = 151387 print_info: BOS token = 151646 '<|begin▁of▁sentence|>' print_info: EOS token = 151643 '<|end▁of▁sentence|>' print_info: EOT token = 151643 '<|end▁of▁sentence|>' print_info: PAD token = 151643 '<|end▁of▁sentence|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 151659 '<|fim_prefix|>' print_info: FIM SUF token = 151661 '<|fim_suffix|>' print_info: FIM MID token = 151660 '<|fim_middle|>' print_info: FIM PAD token = 151662 '<|fim_pad|>' print_info: FIM REP token = 151663 '<|repo_name|>' print_info: FIM SEP token = 151664 '<|file_sep|>' print_info: EOG token = 151643 '<|end▁of▁sentence|>' print_info: EOG token = 151662 '<|fim_pad|>' print_info: EOG token = 151663 '<|repo_name|>' print_info: EOG token = 151664 '<|file_sep|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = true) [New Thread 0x7fffdb35a700 (LWP 31060)] [New Thread 0x7fffd2b59700 (LWP 31061)] load_tensors: offloading 28 repeating layers to GPU load_tensors: offloading output layer to GPU load_tensors: offloaded 29/29 layers to GPU load_tensors: CPU_Mapped model buffer size = 236.47 MiB load_tensors: Vulkan0 model buffer size = 1564.62 MiB ............................................................................ llama_init_from_model: n_seq_max = 1 llama_init_from_model: n_ctx = 131072 llama_init_from_model: n_ctx_per_seq = 131072 llama_init_from_model: n_batch = 2048 llama_init_from_model: n_ubatch = 512 llama_init_from_model: flash_attn = 0 llama_init_from_model: freq_base = 10000.0 llama_init_from_model: freq_scale = 1 llama_kv_cache_init: kv_size = 131072, offload = 0, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1 llama_kv_cache_init: CPU KV buffer size = 3584.00 MiB llama_init_from_model: KV self size = 3584.00 MiB, K (f16): 1792.00 MiB, V (f16): 1792.00 MiB llama_init_from_model: Vulkan_Host output buffer size = 0.58 MiB llama_init_from_model: Vulkan0 compute buffer size = 302.75 MiB llama_init_from_model: Vulkan_Host compute buffer size = 3332.01 MiB llama_init_from_model: graph nodes = 986 llama_init_from_model: graph splits = 58 common_init_from_params: setting dry_penalty_last_n to ctx_size = 131072 common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) [New Thread 0x7fffd37fe700 (LWP 31073)] [New Thread 0x7fffd2358700 (LWP 31075)] [New Thread 0x7fffd1b57700 (LWP 31076)] [Thread 0x7fffd2358700 (LWP 31075) exited] [Thread 0x7fffd37fe700 (LWP 31073) exited] [New Thread 0x7fffd1356700 (LWP 31077)] [New Thread 0x7fffd0b55700 (LWP 31078)] [Thread 0x7fffd1b57700 (LWP 31076) exited] [Thread 0x7fffd1356700 (LWP 31077) exited] [Thread 0x7fffd0b55700 (LWP 31078) exited] terminate called after throwing an instance of 'std::out_of_range' what(): unordered_map::at

Thread 1 "llama-server" received signal SIGABRT, Aborted. 0x00007ffff5471e35 in raise () from /lib64/libc.so.6 (gdb) where #0 0x00007ffff5471e35 in raise () from /lib64/libc.so.6 #1 0x00007ffff545c895 in abort () from /lib64/libc.so.6 #2 0x00007ffff56a2bf9 in __gnu_cxx::__verbose_terminate_handler () at ../../../../gcc/libstdc++-v3/libsupc++/vterminate.cc:95 #3 0x00007ffff56ae26a in __cxxabiv1::__terminate (handler=) at ../../../../gcc/libstdc++-v3/libsupc++/eh_terminate.cc:48 #4 0x00007ffff56ae2d5 in std::terminate () at ../../../../gcc/libstdc++-v3/libsupc++/eh_terminate.cc:58 #5 0x00007ffff56ae527 in __cxxabiv1::__cxa_throw (obj=, tinfo=0x7ffff58141c8 , dest=0x7ffff56c3440 std::out_of_range::~out_of_range()) at ../../../../gcc/libstdc++-v3/libsupc++/eh_throw.cc:98 #6 0x00007ffff56a5500 in std::__throw_out_of_range (__s=0x5555559b819a "unordered_map::at") at ../../../../../gcc/libstdc++-v3/src/c++11/functexcept.cc:86 #7 0x00005555559194e7 in ggml_pipeline_allocate_descriptor_sets(std::shared_ptr<vk_device_struct>&) () #8 0x00005555559398d9 in void ggml_vk_test_matmul<unsigned short, float>(ggml_backend_vk_context*, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, int, int) () #9 0x0000555555914ff4 in ggml_backend_vk_graph_compute(ggml_backend*, ggml_cgraph*) () #10 0x0000555555965417 in ggml_backend_sched_graph_compute_async () #11 0x000055555576fd14 in llama_graph_compute(llama_context&, ggml_cgraph*, int, ggml_threadpool*) () #12 0x000055555576c177 in llama_decode () #13 0x0000555555744d82 in common_init_from_params(common_params&) () #14 0x00005555555b6d05 in server_context::load_model(common_params const&) () #15 0x0000555555584b18 in main () (gdb)

First Bad Commit

  1. I think last version that worked was 4743. I can't run any models which I was able to do with 4743 and below.

Relevant log output

terminate called after throwing an instance of 'std::out_of_range'
  what():  unordered_map::at

acbits avatar Feb 25 '25 22:02 acbits

Please try a clean build folder, and also check for any stale ggml-vulkan-shaders.* files in the source tree.

jeffbolznv avatar Feb 25 '25 22:02 jeffbolznv

Please try a clean build folder, and also check for any stale ggml-vulkan-shaders.* files in the source tree.

My RPM spec file for building. It uses a clean folder every time.

%global __python %{__python3}
%define git_tag b4778

Name:           llama-cpp
Version:        1.0.0
Release:        %{git_tag}
Summary:        local LLM inference engine

License:        MIT


%description
llama-cpp is a LLM inference engine.

# Set the source directory
%define source_dir /builddir/llama/

%prep
cd %{source_dir}
git checkout %{git_tag}

rm -rf build; mkdir build

cd build
export PKG_CONFIG_PATH=/usr/local/lib64/pkgconfig:/usr/local/share/pkgconfig
export CFLAGS=' -O3  -fno-omit-frame-pointer'
export CXXFLAGS=$CFLAGS

CXX=clang++ CC=clang cmake -DCMAKE_INSTALL_PREFIX=/usr/local \
            -DGGML_BLAS:BOOL=ON \
             -DGGML_BLAS_VENDOR=OpenBLAS \
             -DGGML_CCACHE=OFF  \
             -DGGML_VULKAN:BOOL=ON \
             -DGGML_VULKAN_RUN_TESTS=ON \
             -DBUILD_SHARED_LIBS:BOOL=OFF \
             -DCMAKE_BUILD_TYPE=Release ..

%build

cd %{source_dir}/build


nice make -j 64 -l 9

%install
rm -rf %{buildroot}
mkdir -p %{buildroot}

cd %{source_dir}/build

make DESTDIR=%{buildroot} install


%files
/*

acbits avatar Feb 25 '25 23:02 acbits

Did you delete any stale shaders files in the source tree? I was wondering if it was an issue like https://github.com/ggml-org/llama.cpp/issues/11788#issuecomment-2661877838 but somehow manifesting as a runtime shader compile failure rather than a build time failure.

If that's not it, can you try to bisect to a commit?

jeffbolznv avatar Feb 26 '25 00:02 jeffbolznv

I added a git clean -fdx in my build script. This is how it looks before the build.

find . -iname '*shaders*'
./ggml/src/ggml-kompute/kompute-shaders
./ggml/src/ggml-vulkan/vulkan-shaders
./ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp

After the build

find . -iname '*shaders*'
./ggml/src/ggml-kompute/kompute-shaders
./ggml/src/ggml-vulkan/vulkan-shaders
./ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp
./build/bin/vulkan-shaders-gen
./build/ggml/src/ggml-vulkan/CMakeFiles/ggml-vulkan.dir/ggml-vulkan-shaders.cpp.o
./build/ggml/src/ggml-vulkan/CMakeFiles/ggml-vulkan.dir/ggml-vulkan-shaders.cpp.o.d
./build/ggml/src/ggml-vulkan/ggml-vulkan-shaders.hpp
./build/ggml/src/ggml-vulkan/vulkan-shaders.spv
./build/ggml/src/ggml-vulkan/ggml-vulkan-shaders.cpp
./build/ggml/src/ggml-vulkan/vulkan-shaders
./build/ggml/src/ggml-vulkan/vulkan-shaders/CMakeFiles/vulkan-shaders-gen.dir
./build/ggml/src/ggml-vulkan/vulkan-shaders/CMakeFiles/vulkan-shaders-gen.dir/vulkan-shaders-gen.cpp.o.d
./build/ggml/src/ggml-vulkan/vulkan-shaders/CMakeFiles/vulkan-shaders-gen.dir/vulkan-shaders-gen.cpp.o

Even after rebuilding, the failure the same. I will bisect when I get the time or create a debug build and find the exact line number where it fails.

acbits avatar Feb 26 '25 01:02 acbits

The out of range error happens much later after a failed shader compile. There should have been a message to stderr about what shader failed to compile. I think bisecting will be helpful, but there's a good chance this is a driver bug of some sort.

jeffbolznv avatar Feb 26 '25 01:02 jeffbolznv

There are no other messages.

load_tensors:   CPU_Mapped model buffer size =   236.47 MiB
load_tensors:      Vulkan0 model buffer size =  1564.62 MiB
............................................................................
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 4096
llama_init_from_model: n_ctx_per_seq = 4096
llama_init_from_model: n_batch       = 2048
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 10000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1
llama_kv_cache_init:    Vulkan0 KV buffer size =   112.00 MiB
llama_init_from_model: KV self size  =  112.00 MiB, K (f16):   56.00 MiB, V (f16):   56.00 MiB
llama_init_from_model: Vulkan_Host  output buffer size =     0.58 MiB
llama_init_from_model:    Vulkan0 compute buffer size =   299.75 MiB
llama_init_from_model: Vulkan_Host compute buffer size =    11.01 MiB
llama_init_from_model: graph nodes  = 986
llama_init_from_model: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
terminate called after throwing an instance of 'std::out_of_range'
  what():  unordered_map::at

BTW, I am using Vulkan drivers from MESA.

Is there any way, I can dump the shader compilation logs?

acbits avatar Feb 26 '25 01:02 acbits

It's broken tests, you built with -DGGML_VULKAN_RUN_TESTS=ON, which is currently broken. But regardless of that, it should never be enabled on a build you want to use, it's just a number of internal unit tests for development purposes.

This is not the first case of someone enabling this, I guess the confusion is that it looks like a post-build validation test run?

0cc4m avatar Feb 26 '25 08:02 0cc4m

It's broken tests, you built with -DGGML_VULKAN_RUN_TESTS=ON, which is currently broken. But regardless of that, it should never be enabled on a build you want to use, it's just a number of internal unit tests for development purposes.

This is not the first case of someone enabling this, I guess the confusion is that it looks like a post-build validation test run?

Yes. That was it. I had enabled it thinking it is a post build validation test and I enabled it recently due to a few crashes at runtime related to Vulkan.

I think we should disable this option until it is stable.

acbits avatar Feb 26 '25 19:02 acbits

I already fixed it in a feature I'm working on, but it's not yet ready to be merged. But even if it had not crashed, your program would have just run the tests instead of whatever you wanted it to do.

0cc4m avatar Feb 26 '25 20:02 0cc4m