llama.cpp
llama.cpp copied to clipboard
vulkan backend failed to load models vk::Device::createComputePipeline: ErrorUnknown
I am trying to cross-compile llama.cpp on x86 platform and move it to run on Android device (Adreno 740). On Android device, vulkan can recognize my device (GPU) but there is an load model error.
llama_model_load: error loading model: vk::Device::createComputePipeline: ErrorUnknown llama_load_model_from_file: failed to load model llama_init_from_gpt_params: error: failed to load model '/data/local/tmp/stories260K.gguf' main: error: unable to load model
I have checked the model path to make sure the model exists under the path. It has the conditions to be read successfully. I also tried the following models: llama-2-13b-chat.Q2_K.gguf llama-2-13b-chat.Q5_K_S.gguf llama-2-7b-chat.Q2_K.gguf stories260K.gguf
the way i build llama.cpp
cmake .. -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake -DANDROID_ABI=arm64-v8a -DANDROID_NATIVE_API_LEVEL=33 -DLLAMA_VULKAN=1 -DCMAKE_C_FLAGS=-march=armv8.4a+dotprod+i8mm -DVulkan_INCLUDE_DIR=/home/smc/Downloads/Vulkan-Hpp-1.3.237 -DLLAMA_VULKAN_CHECK_RESULTS=1 -DLLAMA_VULKAN_DEBUG=1 -DLLAMA_VULKAN_VALIDATE=1 -DLLAMA_VULKAN_RUN_TESTS=1
make -j10
the way I run main.out Transfer the bin folder to the /data/local/tmp/llama directory on your Android device using scp
./bin/main -t 8 -m /data/local/tmp/stories260K.gguf --color -c 2048 -ngl 2 --temp 0.7 -n 128 -p "One day, Lily met"
uname -a
Linux localhost 5.15.78-android13-8-g60893c660740-dirty #1 SMP PREEMPT Fri Jul 7 18:13:57 UTC 2023 aarch64 Toybox
GPU info Adreno (TM) 740
I want to know what I can do to solve this problem? Any suggestions? Thank you very much and look forward to your reply
detailed information
:/data/local/tmp/llama-vulkan-test # ls
bin
:/data/local/tmp/llama-vulkan-test # chmod +x ./*
tmp/stories260K.gguf --color -c 2048 -ngl 2 --temp 0.7 -n 128 -p "One day, Lily met" <
Log start
main: build = 3 (de46a4b)
main: built with Android (11349228, +pgo, +bolt, +lto, -mlgo, based on r487747e) clang version 17.0.2 (https://android.googlesource.com/toolchain/llvm-project d9f89f4d16663d5012e5c09495f3b30ece3d2362) for x86_64-unknown-linux-gnu
main: seed = 1713873819
llama_model_loader: loaded meta data with 19 key-value pairs and 48 tensors from /data/local/tmp/stories260K.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: tokenizer.ggml.tokens arr[str,512] = ["", "", "<0x00>", "<...
llama_model_loader: - kv 1: tokenizer.ggml.scores arr[f32,512] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 2: tokenizer.ggml.token_type arr[i32,512] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 3: tokenizer.ggml.model str = llama
llama_model_loader: - kv 4: general.architecture str = llama
llama_model_loader: - kv 5: general.name str = llama
llama_model_loader: - kv 6: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 7: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 8: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 9: tokenizer.ggml.seperator_token_id u32 = 4294967295
llama_model_loader: - kv 10: tokenizer.ggml.padding_token_id u32 = 4294967295
llama_model_loader: - kv 11: llama.context_length u32 = 128
llama_model_loader: - kv 12: llama.embedding_length u32 = 64
llama_model_loader: - kv 13: llama.feed_forward_length u32 = 172
llama_model_loader: - kv 14: llama.attention.head_count u32 = 8
llama_model_loader: - kv 15: llama.attention.head_count_kv u32 = 4
llama_model_loader: - kv 16: llama.block_count u32 = 5
llama_model_loader: - kv 17: llama.rope.dimension_count u32 = 8
llama_model_loader: - kv 18: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - type f32: 48 tensors
llm_load_vocab: bad special token: 'tokenizer.ggml.seperator_token_id' = 4294967295d, using default id -1
llm_load_vocab: bad special token: 'tokenizer.ggml.padding_token_id' = 4294967295d, using default id -1
llm_load_vocab: special tokens definition check successful ( 259/512 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 512
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 128
llm_load_print_meta: n_embd = 64
llm_load_print_meta: n_head = 8
llm_load_print_meta: n_head_kv = 4
llm_load_print_meta: n_layer = 5
llm_load_print_meta: n_rot = 8
llm_load_print_meta: n_embd_head_k = 8
llm_load_print_meta: n_embd_head_v = 8
llm_load_print_meta: n_gqa = 2
llm_load_print_meta: n_embd_k_gqa = 32
llm_load_print_meta: n_embd_v_gqa = 32
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 172
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 128
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = all F32 (guessed)
llm_load_print_meta: model params = 292.80 K
llm_load_print_meta: model size = 1.12 MiB (32.00 BPW)
llm_load_print_meta: general.name = llama
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 ''
llm_load_print_meta: UNK token = 0 '
Did you try to build the llama.android example app?
Did you try to build the llama.android example app?
@smilingOrange Not really, because I'm not using termux and Android studio, I cross-compile llama.cpp using the NDK and then transfer the corresponding product to my Android device (Qualcomm GPU) via scp, which I've proven works. Because when I don't use the vulkan back end but use blast for cross-compilation, I can barely get Q2 quantized big models running. But since I'm new to vulkan, I'm not sure why vulkan can recognize the GPU in my device when cross-compiling with the vulkan back end, but can't load the model. I'm curious about how to solve this problem, and feel free to let me know if you have any ideas.
In my debug. The compute shaders for Q4_K and Q5_K are unsupported on Qualcomm Adreno. Without these, it will work.
For more info: Failed shaders are matmul_q4_k_f32_l matmul_q4_k_f32_m matmul_q4_k_f32_s matmul_q4_k_f32_aligned_l matmul_q4_k_f32_aligned_m matmul_q4_k_f32_aligned_e matmul_q5_k_f32_l matmul_q5_k_f32_m matmul_q5_k_f32_s matmul_q5_k_f32_aligned_l matmul_q5_k_f32_aligned_m matmul_q5_k_f32_aligned_s dequant_q4_K dequant_q5_K
Using llama.cpp's vulkan backend with Adreno gpus will be buggy. refer:https://github.com/ggerganov/llama.cpp/issues/5186#issuecomment-1960126390
The issue still exist with latest master (ecf6b7). The program failed to load model due to the failure while creating vk pipeline for matmul_q4_k_f32_l.
The issue still exist with latest master (ecf6b7). The program failed to load model due to the failure while creating vk pipeline for matmul_q4_k_f32_l.
+1,but I tried a Mali G68 GPU it can work,but the speed is too slow,even slower than PURE CPU.