ggml_vulkan: Error: Missing op: ARGSORT
Works with other models that are bigger and smaller. also works with smaller mixtral model. fails on nous-hermes-2-mixtral-8x7b-dpo.Q8_0.gguf
command was:
/bin/main -m /media/asus/A.I.2tb/llm_models/theBloke/Nous-Hermes-2-Mixtral-8x7B-DPO-GGUF/nous-hermes-2-mixtral-8x7b-dpo.Q8_0.gguf -p "Hi you how are you" -n 50 -e -ngl 33 -t 4
Log start main: build = 2168 (d250c9d6) main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu main: seed = 1708199035 ggml_vulkan: Found 4 Vulkan devices: Vulkan0: Tesla P40 | uma: 0 | fp16: 0 | warp size: 32 Vulkan1: Tesla P40 | uma: 0 | fp16: 0 | warp size: 32 Vulkan2: Tesla P40 | uma: 0 | fp16: 0 | warp size: 32 Vulkan3: Tesla P40 | uma: 0 | fp16: 0 | warp size: 32 llama_model_loader: loaded meta data with 26 key-value pairs and 995 tensors from /media/asus/A.I.2tb/llm_models/theBloke/Nous-Hermes-2-Mixtral-8x7B-DPO-GGUF/nous-hermes-2-mixtral-8x7b-dpo.Q8_0.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = nousresearch_nous-hermes-2-mixtral-8x... llama_model_loader: - kv 2: llama.context_length u32 = 32768 llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 llama_model_loader: - kv 4: llama.block_count u32 = 32 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: llama.expert_count u32 = 8 llama_model_loader: - kv 11: llama.expert_used_count u32 = 2 llama_model_loader: - kv 12: llama.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 13: general.file_type u32 = 7 llama_model_loader: - kv 14: tokenizer.ggml.model str = llama llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,32002] = ["<unk>", "<s>", "</s>", "<0x00>", "<... llama_model_loader: - kv 16: tokenizer.ggml.scores arr[f32,32002] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,32002] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 32000 llama_model_loader: - kv 20: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 2 llama_model_loader: - kv 22: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 23: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 24: tokenizer.chat_template str = {% for message in messages %}{{'<|im_... llama_model_loader: - kv 25: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type f16: 32 tensors llama_model_loader: - type q8_0: 898 tensors llm_load_vocab: special tokens definition check successful ( 261/32002 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32002 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 8 llm_load_print_meta: n_expert_used = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = Q8_0 llm_load_print_meta: model params = 46.70 B llm_load_print_meta: model size = 46.22 GiB (8.50 BPW) llm_load_print_meta: general.name = nousresearch_nous-hermes-2-mixtral-8x7b-dpo llm_load_print_meta: BOS token = 1 '<s>' llm_load_print_meta: EOS token = 32000 '<|im_end|>' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: PAD token = 2 '</s>' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 1.90 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CPU buffer size = 132.82 MiB llm_load_tensors: Vulkan0 buffer size = 13235.34 MiB llm_load_tensors: Vulkan1 buffer size = 11764.75 MiB llm_load_tensors: Vulkan2 buffer size = 11764.75 MiB llm_load_tensors: Vulkan3 buffer size = 10426.99 MiB .................................................................................................... llama_new_context_with_model: n_ctx = 32768 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: Vulkan0 KV buffer size = 1152.00 MiB llama_kv_cache_init: Vulkan1 KV buffer size = 1024.00 MiB llama_kv_cache_init: Vulkan2 KV buffer size = 1024.00 MiB llama_kv_cache_init: Vulkan3 KV buffer size = 896.00 MiB llama_new_context_with_model: KV self size = 4096.00 MiB, K (f16): 2048.00 MiB, V (f16): 2048.00 MiB llama_new_context_with_model: Vulkan_Host input buffer size = 73.13 MiB llama_new_context_with_model: Vulkan0 compute buffer size = 2164.03 MiB llama_new_context_with_model: Vulkan1 compute buffer size = 2172.01 MiB llama_new_context_with_model: Vulkan2 compute buffer size = 2172.01 MiB llama_new_context_with_model: Vulkan3 compute buffer size = 2172.01 MiB llama_new_context_with_model: Vulkan_Host compute buffer size = 8.00 MiB llama_new_context_with_model: graph splits (measure): 9 ggml_vulkan: Error: Missing op: ARGSORT GGML_ASSERT: /home/asus/LLMs/llama.cpp/ggml-vulkan.cpp:4256: false Could not attach to process. If your uid matches the uid of the target process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf ptrace: Operation not permitted. No stack. The program is not being run. Aborted (core dumped)
Ubuntu 22.04. Latest Vulkan and llama.cpp.
This is expected. The vulkan backend doesn't support all features yet (including Mixtral architecture). I think this is a documentation issue, it should be made more clear wich features to expect from each backend.
are there plans for vulkan backend to support Mixtral in the near future?
are there plans for vulkan backend to support Mixtral in the near future?
I believe so: https://github.com/ggerganov/llama.cpp/pull/5835#issuecomment-1974877433
This issue was closed because it has been inactive for 14 days since being marked as stale.