llama.cpp
llama.cpp copied to clipboard
HIP SDK with AMD iGPU rocBLAS error
I am using a gfx1103 and try to run llama.cpp on Windows.
Steps done:
- installed HIP SDK
- installed perl & ninja
- created environment variable
set HSA_OVERRIDE_GFX_VERSION=11.0.0(since gfx1103 is normally not supported by HIP SDK) - built llama.cpp:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
set PATH=%HIP_PATH%\bin;%PATH%
mkdir build
cd build
cmake -G Ninja -DAMDGPU_TARGETS=gfx1100 -DLLAMA_HIPBLAS=ON -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ ..
cmake --build .
Note: I used -DAMDGPU_TARGETS=gfx1100 as HSA_OVERRIDE_GFX_VERSION=11.0.0 since gfx1103 is not supported by HIP SDK.
- tried running main with this command:
main.exe -m <model>.gguf -p <prompt> -n 400 -ngl 99 -e. However I get the following error:
Log start
main: build = 2612 (1b496a74)
main: built with for x86_64-pc-windows-msvc
main: seed = 1712338084
llama_model_loader: loaded meta data with 19 key-value pairs and 363 tensors from ../../../mymodels/llama-13b.Q8_0.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = meta-llama-13b
llama_model_loader: - kv 2: llama.context_length u32 = 2048
llama_model_loader: - kv 3: llama.embedding_length u32 = 5120
llama_model_loader: - kv 4: llama.block_count u32 = 40
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 13824
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 40
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 40
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 10: general.file_type u32 = 7
llama_model_loader: - kv 11: tokenizer.ggml.model str = llama
llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv 13: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 17: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 18: general.quantization_version u32 = 2
llama_model_loader: - type f32: 81 tensors
llama_model_loader: - type q8_0: 282 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V2
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 2048
llm_load_print_meta: n_embd = 5120
llm_load_print_meta: n_head = 40
llm_load_print_meta: n_head_kv = 40
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 5120
llm_load_print_meta: n_embd_v_gqa = 5120
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 13824
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 2048
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 13B
llm_load_print_meta: model ftype = Q8_0
llm_load_print_meta: model params = 13.02 B
llm_load_print_meta: model size = 12.88 GiB (8.50 BPW)
llm_load_print_meta: general.name = meta-llama-13b
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
rocBLAS error: Cannot read C:\Program Files\AMD\ROCm\5.7\bin\/rocblas/library/TensileLibrary.dat: No such file or directory for GPU arch : gfx1103
rocBLAS error: Could not initialize Tensile host:
regex_error(error_backref): The expression contained an invalid back reference.
In other words, even though I specify -DAMDGPU_TARGETS=gfx1100 and HSA_OVERRIDE_GFX_VERSION=11.0.0 the program will still try to use gfx1103 which of course is not possible.
Could you please help me with that?
Thanks in advance!
Windows doesn't support HSA_OVERRIDE_GFX_VERSION and probably doesn't have its own equivalent. You would need to compile a Tensile library for gfx1103 for rocBLAS 5.7, or use Linux.
Windows doesn't support
HSA_OVERRIDE_GFX_VERSIONand probably doesn't have its own equivalent. You would need to compile a Tensile library for gfx1103 for rocBLAS 5.7, or use Linux.
https://github.com/ROCm/ROCm/discussions/2631#discussioncomment-7745585 mentioned in linux it can work by just copy TensileLibrary_lazy_gfx1102.dat to TensileLibrary_lazy_gfx1103.dat sudo cp /opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx1102.dat /opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx1103.dat That's probably not the optimal solution, but it works flawlessly, so far.
But seems like it doesn't work in windows?
Thank you very much for your reply! I managed to get it running however the output shows that something goes wrong :/
Here is the command I executed:
main.exe -m ../../../mymodels/llama-13b.Q8_0.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -ngl 99 -e
and this is the output:
Log start
main: build = 2637 (400d5d72)
main: built with for x86_64-pc-windows-msvc
main: seed = 1712761673
llama_model_loader: loaded meta data with 19 key-value pairs and 363 tensors from ../../../mymodels/llama-13b.Q8_0.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = meta-llama-13b
llama_model_loader: - kv 2: llama.context_length u32 = 2048
llama_model_loader: - kv 3: llama.embedding_length u32 = 5120
llama_model_loader: - kv 4: llama.block_count u32 = 40
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 13824
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 40
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 40
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 10: general.file_type u32 = 7
llama_model_loader: - kv 11: tokenizer.ggml.model str = llama
llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv 13: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 17: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 18: general.quantization_version u32 = 2
llama_model_loader: - type f32: 81 tensors
llama_model_loader: - type q8_0: 282 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V2
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 2048
llm_load_print_meta: n_embd = 5120
llm_load_print_meta: n_head = 40
llm_load_print_meta: n_head_kv = 40
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 5120
llm_load_print_meta: n_embd_v_gqa = 5120
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 13824
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 2048
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 13B
llm_load_print_meta: model ftype = Q8_0
llm_load_print_meta: model params = 13.02 B
llm_load_print_meta: model size = 12.88 GiB (8.50 BPW)
llm_load_print_meta: general.name = meta-llama-13b
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon(TM) Graphics, compute capability 11.0, VMM: no
llm_load_tensors: ggml ctx size = 0.28 MiB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors: ROCm0 buffer size = 13023.85 MiB
llm_load_tensors: CPU buffer size = 166.02 MiB
...................................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: ROCm0 KV buffer size = 400.00 MiB
llama_new_context_with_model: KV self size = 400.00 MiB, K (f16): 200.00 MiB, V (f16): 200.00 MiB
llama_new_context_with_model: ROCm_Host output buffer size = 0.12 MiB
llama_new_context_with_model: ROCm0 compute buffer size = 85.00 MiB
llama_new_context_with_model: ROCm_Host compute buffer size = 11.01 MiB
llama_new_context_with_model: graph nodes = 1286
llama_new_context_with_model: graph splits = 2
system_info: n_threads = 8 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
sampling:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 512, n_batch = 2048, n_predict = 400, n_keep = 1
Building a website can be done in 10 simple steps:
Step 1:BU#############################################################################################################ّ...ben”maleseg fuelccsegselitiatippRSKarjervcpet于 IBRSlet serial:#ней seg sa controlledetersourgTo1grbarambambippgelebarbe Mannbaratbugcl Hunter Mannatcivudaviaaviaegujenjencleps Kontletterhaltenkaautéamb graywOperationUPDATE Harr SabplateCPambnylev generalwasplatewarnholdignplate�expertAlertplategeneratedspectgenerated4ex Axagrpert' AlertpertpertpertgenerateMAoviїwasapatangeswasuouthbenpleazugewas...miautéahn Diamiimiwasedenvgeneratedracypecgenerated Ama Hernheswasgeneratedgeneratedbólangevron3ellemaühisiouthisebeni1… Hern4rubenHERHERalus1...g......7...)7VAR Hern Herncuruntangelёpsanzpostvaridl...) AfrbarrsPod blackariegotbarplateillonitzgategeneratedgeneratedgeneratedpagespodades Success MovресavidissesPDkesenc-Sivpecpec bars hochnad deltatsis tagsitesudдьSaeraskifanfanes Urseskena|nach enforesherr|bersever (' scalkk Lubnak Gor3 embarkolPD >>cal migr scal| scalplateGeneral lad|kalkolcur0 lac lac separatelypecDAT lac lac lacCldledenklpecaadiuntkalpeнейeszPD amazonPD Petythjar Ceteskstan
llama_print_timings: load time = 13189.40 ms
llama_print_timings: sample time = 31.87 ms / 400 runs ( 0.08 ms per token, 12551.78 tokens per second)
llama_print_timings: prompt eval time = 1029.69 ms / 19 tokens ( 54.19 ms per token, 18.45 tokens per second)
llama_print_timings: eval time = 86713.29 ms / 399 runs ( 217.33 ms per token, 4.60 tokens per second)
llama_print_timings: total time = 88241.82 ms / 418 tokens
Log end
Have you ever seen that before?
Thank you in advance!
Unlike RDNA2 where everything is more or less gfx1030 RDNA3 ISAs have significant differences. In the linked comment '(more than "-ngl 32" resulted in gibberish)'. You could try offloading 1 less layer than the max and setting --no-kv-offload, or try a 7B llama model and the same. One possibility is the kernels for whatever matrix multiplication shapes are needed for mistral/llama7B will work on gfx1103 despite being compiled for gfx1102, but other kernels rocblas uses for e.g. the kv cache won't because of an architectural difference. Or rather than using Tensile kernels for some shapes rocblas might use the compiled-in ones which won't be there for gfx1103 unless you recompile rocblas.
Thank you very much for this detailed explanation! Indeed after some testing I have to set the --no-kv-offload and get the -ngl down. In fact I get correct results with the -ngl up to 24, by the moment I set it to 25 the result is not right. I am still not sure why though xD
I have yet to test with the llama7B.
I also tested llama-2-7b-chat.Q4_0.gguf , it only reached 0.11 tok/s in 780M Radeon Graphics (gfx1103)...., is there any way to have better performance? C:\code\llama.cpp\build\bin>.\main -m c:\code\llama-2-7b-chat.Q4_0.gguf -p "introduce shanghai" -n 128 --no-kv-offload -ngl 24 -e -t 4
Log start
main: build = 2647 (8228b66d)
main: built with for x86_64-pc-windows-msvc
main: seed = 1712791923
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from c:\code\llama-2-7b-chat.Q4_0.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = LLaMA v2
llama_model_loader: - kv 2: llama.context_length u32 = 4096
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
llama_model_loader: - kv 4: llama.block_count u32 = 32
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 11008
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 32
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 10: general.file_type u32 = 2
llama_model_loader: - kv 11: tokenizer.ggml.model str = llama
llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,32000] = ["", "", "<0x00>", "<...
llama_model_loader: - kv 13: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 17: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 18: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q4_0: 225 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V2
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 32
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 4096
llm_load_print_meta: n_embd_v_gqa = 4096
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 11008
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 4096
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = Q4_0
llm_load_print_meta: model params = 6.74 B
llm_load_print_meta: model size = 3.56 GiB (4.54 BPW)
llm_load_print_meta: general.name = LLaMA v2
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 ''
llm_load_print_meta: UNK token = 0 '
system_info: n_threads = 4 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | sampling: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 sampling order: CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature generate: n_ctx = 512, n_batch = 2048, n_predict = 128, n_keep = 1
introduce shanghaiggml_gallocr_needs_realloc: node inp_embd is not valid ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve ggml_backend_sched_alloc_splits: failed to allocate graph, reserving to the world. nobody was surprised. China's largest city, Shanghai, has always been a place of contrasts. From the towering skyscrapers and bustling streets of the city center to the tranquil waterways and traditional architecture of the old town, there is always something new to discover. Shanghai is a city of contrasts, a place where the ancient and the modern coexist. The city's rich history, culture, and natural beauty make it a unique and fascinating destination for travelers. Shanghai is a city of contrasts, a llama_print_timings: load time = 16934.81 ms llama_print_timings: sample time = 19.85 ms / 128 runs ( 0.16 ms per token, 6447.39 tokens per second) llama_print_timings: prompt eval time = 40527.34 ms / 5 tokens ( 8105.47 ms per token, 0.12 tokens per second) llama_print_timings: eval time = 1106401.19 ms / 127 runs ( 8711.82 ms per token, 0.11 tokens per second) llama_print_timings: total time = 1147104.15 ms / 132 tokens Log end
is there any way or plan to make it have better performance? thanks!
This issue was closed because it has been inactive for 14 days since being marked as stale.