llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Eval bug: Several models producing gibberish

Open iamangus opened this issue 1 day ago • 11 comments

Name and Version

[root@localhost ~]# ~/llama.cpp/build/bin/llama-cli --version ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 ROCm devices: Device 0: AMD Radeon VII, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64 Device 1: AMD Radeon VII, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64 register_backend: registered backend ROCm (2 devices) register_device: registered device ROCm0 (AMD Radeon VII) register_device: registered device ROCm1 (AMD Radeon VII) register_backend: registered backend CPU (1 devices) register_device: registered device CPU (Intel(R) Celeron(R) CPU G3930 @ 2.90GHz) load_backend: failed to find ggml_backend_init in /root/llama.cpp/build/bin/libggml-hip.so load_backend: failed to find ggml_backend_init in /root/llama.cpp/build/bin/libggml-cpu.so version: 4753 (51f311e0) built with cc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-23) for x86_64-redhat-linux

Operating systems

Mac, Linux

GGML backends

HIP

Hardware

CPU = G3930 GPU = 2x Instinct Mi50

Models

https://huggingface.co/microsoft/phi-4-gguf/blob/main/phi-4-q4.gguf https://huggingface.co/YorkieOH10/Meta-Llama-3.1-8B-Instruct-Q8_0-GGUF/resolve/main/meta-llama-3.1-8b-instruct-q8_0.gguf?download=true

Problem description & steps to reproduce

Getting random character strings when offloading to GPU.

~/llama.cpp/build/bin/llama-cli -m ~/phi-4-q4.gguf -p "Hello!" -ngl 999

Installed ROCm following the below steps for Alma8.10: https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/install-methods/package-manager/package-manager-rhel.html

built llama.cpp follwing the below steps: https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md#hip

It seems to work fine when not offloading to the GPU and just running on CPU. Slowly of course, but it works.

First Bad Commit

No response

Relevant log output

[root@localhost ~]# ~/llama.cpp/build/bin/llama-cli -m ~/phi-4-q4.gguf -p "Hello!" -ngl 999
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 ROCm devices:
  Device 0: AMD Radeon VII, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
  Device 1: AMD Radeon VII, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
register_backend: registered backend ROCm (2 devices)
register_device: registered device ROCm0 (AMD Radeon VII)
register_device: registered device ROCm1 (AMD Radeon VII)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (Intel(R) Celeron(R) CPU G3930 @ 2.90GHz)
load_backend: failed to find ggml_backend_init in /root/llama.cpp/build/bin/libggml-hip.so
load_backend: failed to find ggml_backend_init in /root/llama.cpp/build/bin/libggml-cpu.so
build: 4753 (51f311e0) with cc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-23) for x86_64-redhat-linux (debug)
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon VII) - 16348 MiB free
llama_model_load_from_file_impl: using device ROCm1 (AMD Radeon VII) - 16348 MiB free
llama_model_loader: loaded meta data with 33 key-value pairs and 243 tensors from /root/phi-4-q4.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = phi3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Phi 4
llama_model_loader: - kv   3:                            general.version str              = 4
llama_model_loader: - kv   4:                       general.organization str              = Microsoft
llama_model_loader: - kv   5:                           general.basename str              = phi
llama_model_loader: - kv   6:                         general.size_label str              = 15B
llama_model_loader: - kv   7:                            general.license str              = mit
llama_model_loader: - kv   8:                       general.license.link str              = https://huggingface.co/microsoft/phi-...
llama_model_loader: - kv   9:                               general.tags arr[str,7]       = ["phi", "nlp", "math", "code", "chat"...
llama_model_loader: - kv  10:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  11:                        phi3.context_length u32              = 16384
llama_model_loader: - kv  12:  phi3.rope.scaling.original_context_length u32              = 16384
llama_model_loader: - kv  13:                      phi3.embedding_length u32              = 5120
llama_model_loader: - kv  14:                   phi3.feed_forward_length u32              = 17920
llama_model_loader: - kv  15:                           phi3.block_count u32              = 40
llama_model_loader: - kv  16:                  phi3.attention.head_count u32              = 40
llama_model_loader: - kv  17:               phi3.attention.head_count_kv u32              = 10
llama_model_loader: - kv  18:      phi3.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  19:                  phi3.rope.dimension_count u32              = 128
llama_model_loader: - kv  20:                        phi3.rope.freq_base f32              = 250000.000000
llama_model_loader: - kv  21:              phi3.attention.sliding_window u32              = 0
llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = dbrx
llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,100352]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,100352]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,100000]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 100257
llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 100257
llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 100257
llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {% for message in messages %}{% if (m...
llama_model_loader: - kv  31:               general.quantization_version u32              = 2
llama_model_loader: - kv  32:                          general.file_type u32              = 15
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type q4_K:  101 tensors
llama_model_loader: - type q5_K:   40 tensors
llama_model_loader: - type q6_K:   21 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 8.43 GiB (4.94 BPW) 
load: special tokens cache size = 96
load: token to piece cache size = 0.6151 MB
print_info: arch             = phi3
print_info: vocab_only       = 0
print_info: n_ctx_train      = 16384
print_info: n_embd           = 5120
print_info: n_layer          = 40
print_info: n_head           = 40
print_info: n_head_kv        = 10
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 1280
print_info: n_embd_v_gqa     = 1280
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 17920
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 250000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 16384
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 14B
print_info: model params     = 14.66 B
print_info: general.name     = Phi 4
print_info: vocab type       = BPE
print_info: n_vocab          = 100352
print_info: n_merges         = 100000
print_info: BOS token        = 100257 '<|endoftext|>'
print_info: EOS token        = 100257 '<|endoftext|>'
print_info: EOT token        = 100257 '<|endoftext|>'
print_info: PAD token        = 100257 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 100258 '<|fim_prefix|>'
print_info: FIM SUF token    = 100260 '<|fim_suffix|>'
print_info: FIM MID token    = 100259 '<|fim_middle|>'
print_info: EOG token        = 100257 '<|endoftext|>'
print_info: EOG token        = 100265 '<|im_end|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 40 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 41/41 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   275.62 MiB
load_tensors:        ROCm0 model buffer size =  4163.91 MiB
load_tensors:        ROCm1 model buffer size =  4190.80 MiB
.......................................................................................
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 4096
llama_init_from_model: n_ctx_per_seq = 4096
llama_init_from_model: n_batch       = 2048
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 250000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_per_seq (4096) < n_ctx_train (16384) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 40, can_shift = 1
llama_kv_cache_init:      ROCm0 KV buffer size =   420.00 MiB
llama_kv_cache_init:      ROCm1 KV buffer size =   380.00 MiB
llama_init_from_model: KV self size  =  800.00 MiB, K (f16):  400.00 MiB, V (f16):  400.00 MiB
llama_init_from_model:  ROCm_Host  output buffer size =     0.38 MiB
llama_init_from_model: pipeline parallelism enabled (n_copies=4)
llama_init_from_model:      ROCm0 compute buffer size =   437.01 MiB
llama_init_from_model:      ROCm1 compute buffer size =   437.02 MiB
llama_init_from_model:  ROCm_Host compute buffer size =    42.02 MiB
llama_init_from_model: graph nodes  = 1606
llama_init_from_model: graph splits = 3
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 2
main: chat template is available, enabling conversation mode (disable it with -no-cnv)
main: chat template example:
<|im_start|>system<|im_sep|>You are a helpful assistant<|im_end|><|im_start|>user<|im_sep|>Hello<|im_end|><|im_start|>assistant<|im_sep|>Hi there<|im_end|><|im_start|>user<|im_sep|>How are you?<|im_end|><|im_start|>assistant<|im_sep|>

system_info: n_threads = 2 (n_threads_batch = 2) / 2 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 

main: interactive mode on.
sampler seed: 1123923216
sampler params: 
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 0

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

systemHello!


> 
&3%9C(6B>;#$C/F;;8/49=0%41%588-6.D5>BB8;)/H@=!$9+,GC51(>40=&89&$'G>2GFF0C*69F-8/<$A88>;@+CB6-#C1B!*=<"5-:4.<'*&E7/A>(G%!-:G*72D+/G+B*://;;3"A'9,*E<FDHG4-524"E:$5F1:7;AA4(45/%%%2;81;8./#5'C'2E$@>@8(%;2<<F
> hello!
%':8B2+F@H!/;,7*;F$"'@!&/&<E6;06:@H(8);-;50>337";*
> 
llama_perf_sampler_print:    sampling time =      37.41 ms /    60 runs   (    0.62 ms per token,  1603.93 tokens per second)
llama_perf_context_print:        load time =   13567.06 ms
llama_perf_context_print: prompt eval time =    4097.00 ms /    17 tokens (  241.00 ms per token,     4.15 tokens per second)
llama_perf_context_print:        eval time =    6391.57 ms /   251 runs   (   25.46 ms per token,    39.27 tokens per second)
llama_perf_context_print:       total time =  104277.98 ms /   268 tokens
Interrupted by user
[root@localhost ~]# ^C

iamangus avatar Feb 21 '25 20:02 iamangus