llama.cpp
llama.cpp copied to clipboard
Eval bug: Several models producing gibberish
Name and Version
[root@localhost ~]# ~/llama.cpp/build/bin/llama-cli --version ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 ROCm devices: Device 0: AMD Radeon VII, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64 Device 1: AMD Radeon VII, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64 register_backend: registered backend ROCm (2 devices) register_device: registered device ROCm0 (AMD Radeon VII) register_device: registered device ROCm1 (AMD Radeon VII) register_backend: registered backend CPU (1 devices) register_device: registered device CPU (Intel(R) Celeron(R) CPU G3930 @ 2.90GHz) load_backend: failed to find ggml_backend_init in /root/llama.cpp/build/bin/libggml-hip.so load_backend: failed to find ggml_backend_init in /root/llama.cpp/build/bin/libggml-cpu.so version: 4753 (51f311e0) built with cc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-23) for x86_64-redhat-linux
Operating systems
Mac, Linux
GGML backends
HIP
Hardware
CPU = G3930 GPU = 2x Instinct Mi50
Models
https://huggingface.co/microsoft/phi-4-gguf/blob/main/phi-4-q4.gguf https://huggingface.co/YorkieOH10/Meta-Llama-3.1-8B-Instruct-Q8_0-GGUF/resolve/main/meta-llama-3.1-8b-instruct-q8_0.gguf?download=true
Problem description & steps to reproduce
Getting random character strings when offloading to GPU.
~/llama.cpp/build/bin/llama-cli -m ~/phi-4-q4.gguf -p "Hello!" -ngl 999
Installed ROCm following the below steps for Alma8.10: https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/install-methods/package-manager/package-manager-rhel.html
built llama.cpp follwing the below steps: https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md#hip
It seems to work fine when not offloading to the GPU and just running on CPU. Slowly of course, but it works.
First Bad Commit
No response
Relevant log output
[root@localhost ~]# ~/llama.cpp/build/bin/llama-cli -m ~/phi-4-q4.gguf -p "Hello!" -ngl 999
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 ROCm devices:
Device 0: AMD Radeon VII, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
Device 1: AMD Radeon VII, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
register_backend: registered backend ROCm (2 devices)
register_device: registered device ROCm0 (AMD Radeon VII)
register_device: registered device ROCm1 (AMD Radeon VII)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (Intel(R) Celeron(R) CPU G3930 @ 2.90GHz)
load_backend: failed to find ggml_backend_init in /root/llama.cpp/build/bin/libggml-hip.so
load_backend: failed to find ggml_backend_init in /root/llama.cpp/build/bin/libggml-cpu.so
build: 4753 (51f311e0) with cc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-23) for x86_64-redhat-linux (debug)
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon VII) - 16348 MiB free
llama_model_load_from_file_impl: using device ROCm1 (AMD Radeon VII) - 16348 MiB free
llama_model_loader: loaded meta data with 33 key-value pairs and 243 tensors from /root/phi-4-q4.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = phi3
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Phi 4
llama_model_loader: - kv 3: general.version str = 4
llama_model_loader: - kv 4: general.organization str = Microsoft
llama_model_loader: - kv 5: general.basename str = phi
llama_model_loader: - kv 6: general.size_label str = 15B
llama_model_loader: - kv 7: general.license str = mit
llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/microsoft/phi-...
llama_model_loader: - kv 9: general.tags arr[str,7] = ["phi", "nlp", "math", "code", "chat"...
llama_model_loader: - kv 10: general.languages arr[str,1] = ["en"]
llama_model_loader: - kv 11: phi3.context_length u32 = 16384
llama_model_loader: - kv 12: phi3.rope.scaling.original_context_length u32 = 16384
llama_model_loader: - kv 13: phi3.embedding_length u32 = 5120
llama_model_loader: - kv 14: phi3.feed_forward_length u32 = 17920
llama_model_loader: - kv 15: phi3.block_count u32 = 40
llama_model_loader: - kv 16: phi3.attention.head_count u32 = 40
llama_model_loader: - kv 17: phi3.attention.head_count_kv u32 = 10
llama_model_loader: - kv 18: phi3.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 19: phi3.rope.dimension_count u32 = 128
llama_model_loader: - kv 20: phi3.rope.freq_base f32 = 250000.000000
llama_model_loader: - kv 21: phi3.attention.sliding_window u32 = 0
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 23: tokenizer.ggml.pre str = dbrx
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 100257
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 100257
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 100257
llama_model_loader: - kv 30: tokenizer.chat_template str = {% for message in messages %}{% if (m...
llama_model_loader: - kv 31: general.quantization_version u32 = 2
llama_model_loader: - kv 32: general.file_type u32 = 15
llama_model_loader: - type f32: 81 tensors
llama_model_loader: - type q4_K: 101 tensors
llama_model_loader: - type q5_K: 40 tensors
llama_model_loader: - type q6_K: 21 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 8.43 GiB (4.94 BPW)
load: special tokens cache size = 96
load: token to piece cache size = 0.6151 MB
print_info: arch = phi3
print_info: vocab_only = 0
print_info: n_ctx_train = 16384
print_info: n_embd = 5120
print_info: n_layer = 40
print_info: n_head = 40
print_info: n_head_kv = 10
print_info: n_rot = 128
print_info: n_swa = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 4
print_info: n_embd_k_gqa = 1280
print_info: n_embd_v_gqa = 1280
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: n_ff = 17920
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 250000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 16384
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 14B
print_info: model params = 14.66 B
print_info: general.name = Phi 4
print_info: vocab type = BPE
print_info: n_vocab = 100352
print_info: n_merges = 100000
print_info: BOS token = 100257 '<|endoftext|>'
print_info: EOS token = 100257 '<|endoftext|>'
print_info: EOT token = 100257 '<|endoftext|>'
print_info: PAD token = 100257 '<|endoftext|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 100258 '<|fim_prefix|>'
print_info: FIM SUF token = 100260 '<|fim_suffix|>'
print_info: FIM MID token = 100259 '<|fim_middle|>'
print_info: EOG token = 100257 '<|endoftext|>'
print_info: EOG token = 100265 '<|im_end|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 40 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 41/41 layers to GPU
load_tensors: CPU_Mapped model buffer size = 275.62 MiB
load_tensors: ROCm0 model buffer size = 4163.91 MiB
load_tensors: ROCm1 model buffer size = 4190.80 MiB
.......................................................................................
llama_init_from_model: n_seq_max = 1
llama_init_from_model: n_ctx = 4096
llama_init_from_model: n_ctx_per_seq = 4096
llama_init_from_model: n_batch = 2048
llama_init_from_model: n_ubatch = 512
llama_init_from_model: flash_attn = 0
llama_init_from_model: freq_base = 250000.0
llama_init_from_model: freq_scale = 1
llama_init_from_model: n_ctx_per_seq (4096) < n_ctx_train (16384) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 40, can_shift = 1
llama_kv_cache_init: ROCm0 KV buffer size = 420.00 MiB
llama_kv_cache_init: ROCm1 KV buffer size = 380.00 MiB
llama_init_from_model: KV self size = 800.00 MiB, K (f16): 400.00 MiB, V (f16): 400.00 MiB
llama_init_from_model: ROCm_Host output buffer size = 0.38 MiB
llama_init_from_model: pipeline parallelism enabled (n_copies=4)
llama_init_from_model: ROCm0 compute buffer size = 437.01 MiB
llama_init_from_model: ROCm1 compute buffer size = 437.02 MiB
llama_init_from_model: ROCm_Host compute buffer size = 42.02 MiB
llama_init_from_model: graph nodes = 1606
llama_init_from_model: graph splits = 3
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 2
main: chat template is available, enabling conversation mode (disable it with -no-cnv)
main: chat template example:
<|im_start|>system<|im_sep|>You are a helpful assistant<|im_end|><|im_start|>user<|im_sep|>Hello<|im_end|><|im_start|>assistant<|im_sep|>Hi there<|im_end|><|im_start|>user<|im_sep|>How are you?<|im_end|><|im_start|>assistant<|im_sep|>
system_info: n_threads = 2 (n_threads_batch = 2) / 2 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
main: interactive mode on.
sampler seed: 1123923216
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 0
== Running in interactive mode. ==
- Press Ctrl+C to interject at any time.
- Press Return to return control to the AI.
- To return control without starting a new line, end your input with '/'.
- If you want to submit another line, end your input with '\'.
systemHello!
>
&3%9C(6B>;#$C/F;;8/49=0%41%588-6.D5>BB8;)/H@=!$9+,GC51(>40=&89&$'G>2GFF0C*69F-8/<$A88>;@+CB6-#C1B!*=<"5-:4.<'*&E7/A>(G%!-:G*72D+/G+B*://;;3"A'9,*E<FDHG4-524"E:$5F1:7;AA4(45/%%%2;81;8./#5'C'2E$@>@8(%;2<<F
> hello!
%':8B2+F@H!/;,7*;F$"'@!&/&<E6;06:@H(8);-;50>337";*
>
llama_perf_sampler_print: sampling time = 37.41 ms / 60 runs ( 0.62 ms per token, 1603.93 tokens per second)
llama_perf_context_print: load time = 13567.06 ms
llama_perf_context_print: prompt eval time = 4097.00 ms / 17 tokens ( 241.00 ms per token, 4.15 tokens per second)
llama_perf_context_print: eval time = 6391.57 ms / 251 runs ( 25.46 ms per token, 39.27 tokens per second)
llama_perf_context_print: total time = 104277.98 ms / 268 tokens
Interrupted by user
[root@localhost ~]# ^C