llama.cpp Kumpute not offloading any gpu layers

Hello,

I was trying to use kompute, I manage to compile llamacpp kompute with the follow steps

git clone https://github.com/ggerganov/llama.cpp.git

cd llama.cpp/

#install kompute repo inside kompute folder
https://github.com/nomic-ai/kompute

mkdir -p build
cd build
cmake .. -DLLAMA_KOMPUTE=1
cmake --build . --config Release

then for run the model i did like the follow

./bin/main -m openhermes-2.5-neural-chat-v3-3-slerp.Q8_0.gguf -p "Hi you how are you" -ngl 90

and got the follow output

llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors:        CPU buffer size =  7338.64 MiB

Feb 16 '24 19:02 userbox020

what is the kompute doing for?

Feb 18 '24 06:02 pure-water

what is the kompute doing for?

It theorically optimize performance for some gpus

Feb 18 '24 06:02 userbox020

it seems working fine on my set up( intel cpu + integrated gpu), it is said offloading to GPU with all layers however marginal perforamnce difference.

Feb 18 '24 06:02 pure-water

@userbox020 it looks Q8_0 quantization is not supported: https://github.com/ggerganov/llama.cpp/blob/8f1be0d42f23016cb6819dbae01126699c4bd9bc/llama.cpp#L4488-L4502

You might notice with openhermes-2.5-neural-chat-v3-3-slerp.Q8_0.gguf:

llama_model_load: disabling Kompute due to unsupported model arch or quantization

Tested openhermes-2.5-neural-chat-v3-3-slerp.Q4_0.gguf with NVIDIA GeForce RTX 3050 and Kompute / Vulkan it managed to offload 16/33 layers to GPU. But performances are not there for this model, CPU is faster.

Feb 18 '24 10:02 phymbert

@phymbert thanks bro, going to check with lower quantz, do you know which one its the maximum it suporrt? in your code snippet looks like only q4 and f32 and f16 dont know what quant is it

Feb 18 '24 19:02 userbox020

Same issue here. Vulkan works ok-isch on my AMD Vega VII with about 20% GPU usage. Kompute does not work with models I tested like https://huggingface.co/TheBloke/WizardLM-1.0-Uncensored-Llama2-13B-GGUF

Click me

PS C:\Code\ML\llamacpp\kompute> .\server.exe -m ..\models\7b\wizardlm-1.0-uncensored-llama2-13b.Q4_K_M.gguf
{"timestamp":1708595348,"level":"INFO","function":"main","line":2574,"message":"build info","build":2234,"commit":"973053d8"}
{"timestamp":1708595348,"level":"INFO","function":"main","line":2581,"message":"system info","n_threads":16,"n_threads_batch":-1,"total_threads":32,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | "}

llama server listening at http://127.0.0.1:8080

{"timestamp":1708595348,"level":"INFO","function":"main","line":2731,"message":"HTTP server listening","hostname":"127.0.0.1","port":"8080"}
llama_model_loader: loaded meta data with 19 key-value pairs and 363 tensors from ..\models\7b\wizardlm-1.0-uncensored-llama2-13b.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 5120
llama_model_loader: - kv   4:                          llama.block_count u32              = 40
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 13824
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 40
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 40
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 15
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type q4_K:  241 tensors
llama_model_loader: - type q6_K:   41 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_head           = 40
llm_load_print_meta: n_head_kv        = 40
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 5120
llm_load_print_meta: n_embd_v_gqa     = 5120
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 13824
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 13B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 13.02 B
llm_load_print_meta: model size       = 7.33 GiB (4.83 BPW)
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.14 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/41 layers to GPU
llm_load_tensors:        CPU buffer size =  7500.85 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   400.00 MiB
llama_new_context_with_model: KV self size  =  400.00 MiB, K (f16):  200.00 MiB, V (f16):  200.00 MiB
llama_new_context_with_model:        CPU input buffer size   =    12.01 MiB
llama_new_context_with_model:        CPU compute buffer size =    80.00 MiB
llama_new_context_with_model: graph splits (measure): 1
Available slots:
 -> Slot 0 - max context: 512
{"timestamp":1708595350,"level":"INFO","function":"main","line":2752,"message":"model loaded"}
all slots are idle and system prompt is empty, clear the KV cache

Feb 22 '24 09:02 kelteseth

Same issue here. Vulkan works ok-isch on my AMD Vega VII with about 20% GPU usage. Kompute does not work with models I tested like https://huggingface.co/TheBloke/WizardLM-1.0-Uncensored-Llama2-13B-GGUF

Click me

PS C:\Code\ML\llamacpp\kompute> .\server.exe -m ..\models\7b\wizardlm-1.0-uncensored-llama2-13b.Q4_K_M.gguf
{"timestamp":1708595348,"level":"INFO","function":"main","line":2574,"message":"build info","build":2234,"commit":"973053d8"}
{"timestamp":1708595348,"level":"INFO","function":"main","line":2581,"message":"system info","n_threads":16,"n_threads_batch":-1,"total_threads":32,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | "}

llama server listening at http://127.0.0.1:8080

{"timestamp":1708595348,"level":"INFO","function":"main","line":2731,"message":"HTTP server listening","hostname":"127.0.0.1","port":"8080"}
llama_model_loader: loaded meta data with 19 key-value pairs and 363 tensors from ..\models\7b\wizardlm-1.0-uncensored-llama2-13b.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 5120
llama_model_loader: - kv   4:                          llama.block_count u32              = 40
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 13824
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 40
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 40
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 15
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type q4_K:  241 tensors
llama_model_loader: - type q6_K:   41 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_head           = 40
llm_load_print_meta: n_head_kv        = 40
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 5120
llm_load_print_meta: n_embd_v_gqa     = 5120
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 13824
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 13B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 13.02 B
llm_load_print_meta: model size       = 7.33 GiB (4.83 BPW)
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.14 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/41 layers to GPU
llm_load_tensors:        CPU buffer size =  7500.85 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   400.00 MiB
llama_new_context_with_model: KV self size  =  400.00 MiB, K (f16):  200.00 MiB, V (f16):  200.00 MiB
llama_new_context_with_model:        CPU input buffer size   =    12.01 MiB
llama_new_context_with_model:        CPU compute buffer size =    80.00 MiB
llama_new_context_with_model: graph splits (measure): 1
Available slots:
 -> Slot 0 - max context: 512
{"timestamp":1708595350,"level":"INFO","function":"main","line":2752,"message":"model loaded"}
all slots are idle and system prompt is empty, clear the KV cache

Sup bro they dont work because Kompute only support at the moment

             model.ftype == LLAMA_FTYPE_ALL_F32 || 
             model.ftype == LLAMA_FTYPE_MOSTLY_F16 || 
             model.ftype == LLAMA_FTYPE_MOSTLY_Q4_0 || 
             model.ftype == LLAMA_FTYPE_MOSTLY_Q4_1

and you trying a model Q4_K_M

Feb 23 '24 23:02 userbox020

Same issue here. Vulkan works ok-isch on my AMD Vega VII with about 20% GPU usage.

Same issue here. Vulkan works ok-isch on my AMD Vega VII with about 20% GPU usage. Kompute does not work with models I tested like https://huggingface.co/TheBloke/WizardLM-1.0-Uncensored-Llama2-13B-GGUF

Click me

out of interest, how much performance gain you can have for these 20% GPU usage versus CPU?

Feb 24 '24 03:02 pure-water

Thats not make sense, kompute it's an optimized vulkan. Something must be wrong programmed or configurated. I managed to install llamacpp kompute but because it's still on development and only support 4 types of quantz I didnt do further tests. At the moment i'm working with llama vulkan

Feb 24 '24 12:02 userbox020

This issue was closed because it has been inactive for 14 days since being marked as stale.

Apr 10 '24 01:04 github-actions[bot]