llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Why does every answer end with <|img end|>?

Open ChaoII opened this issue 1 year ago • 7 comments

D:\llama.cpp\models>..\build\install\bin\main.exe -m qwen1_5-4b-chat-q4_0.gguf -cml --color -i
Log start
main: build = 2725 (784e11de)
main: built with MSVC 19.35.32215.0 for
main: seed  = 1714123916
llama_model_loader: loaded meta data with 21 key-value pairs and 483 tensors from qwen1_5-4b-chat-q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.name str              = Qwen1.5-4B-Chat-AWQ-fp16
llama_model_loader: - kv   2:                          qwen2.block_count u32              = 40
llama_model_loader: - kv   3:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv   4:                     qwen2.embedding_length u32              = 2560
llama_model_loader: - kv   5:                  qwen2.feed_forward_length u32              = 6912
llama_model_loader: - kv   6:                 qwen2.attention.head_count u32              = 20
llama_model_loader: - kv   7:              qwen2.attention.head_count_kv u32              = 20
llama_model_loader: - kv   8:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv   9:                       qwen2.rope.freq_base f32              = 5000000.000000
llama_model_loader: - kv  10:                qwen2.use_parallel_residual bool             = true
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  13:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  14:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  15:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  16:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  18:                    tokenizer.chat_template str              = {% for message in messages %}{{'<|im_...
llama_model_loader: - kv  19:               general.quantization_version u32              = 2
llama_model_loader: - kv  20:                          general.file_type u32              = 2
llama_model_loader: - type  f32:  201 tensors
llama_model_loader: - type q4_0:  281 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 293/151936 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151936
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 2560
llm_load_print_meta: n_head           = 20
llm_load_print_meta: n_head_kv        = 20
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 2560
llm_load_print_meta: n_embd_v_gqa     = 2560
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 6912
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 5000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 3.95 B
llm_load_print_meta: model size       = 2.16 GiB (4.71 BPW)
llm_load_print_meta: general.name     = Qwen1.5-4B-Chat-AWQ-fp16
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes
llm_load_tensors: ggml ctx size =    0.23 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/41 layers to GPU
llm_load_tensors:        CPU buffer size =  2216.46 MiB
...............................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 5000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =   200.00 MiB
llama_new_context_with_model: KV self size  =  200.00 MiB, K (f16):  100.00 MiB, V (f16):  100.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.58 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   606.03 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    11.01 MiB
llama_new_context_with_model: graph nodes  = 1406
llama_new_context_with_model: graph splits = 564

system_info: n_threads = 10 / 20 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 |
main: interactive mode on.
Reverse prompt: '<|im_start|>user
'
sampling:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 512, n_batch = 2048, n_predict = -1, n_keep = 4


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

<|im_start|>system
<|im_end|>
<|im_start|>user

> hi
Hello! How can I help you today? If you have any questions or need assistance, feel free to ask.<|im_end|>

> What's your name?
I am Qwen, a large language model created by Alibaba Cloud.<|im_end|>

ChaoII avatar Apr 26 '24 09:04 ChaoII

EOS token = 151645 '<|im_end|>'

<|im_end|> is the EOS token for -cml models like Qwen.

Jeximo avatar Apr 26 '24 12:04 Jeximo

I did use the Qwen model. What can I do?

ChaoII avatar Apr 26 '24 12:04 ChaoII

I did use the Qwen model. What can I do?

@ChaoII It worked as intended.

offloaded 0/41 layers to GPU

Don't forget to add the -ngl 99 parameter to enable your GPU for faster speed.

Jeximo avatar Apr 26 '24 12:04 Jeximo

@Jeximo Thanks a lot! No wonder conversation generation is so slow.😄~(@^_^@)~

ChaoII avatar Apr 26 '24 13:04 ChaoII

I did use the Qwen model. What can I do?

You were not facing this before, I guess? It is only after an upgrade? If so, then please revert to an older version and use that.

Update: Try using --chat-template llama3 at the server command. It worked for me, it seemed. I tried, mythomax and it worked.

Further Update: It doesn't work always. I am now using textgenui, seems to work like a charm. The latest version. Same client ocde, same old model, loaded via llama.cpp.

QueryType avatar Apr 26 '24 13:04 QueryType

in recent commits, the "server" program could auto detect chatml(Chat Markup Language) special tokens from model's metadata and auto remove this chatml special tokens ,and in "main" program, you can run with flag "-cml -cnv" for remove this token

Dawn-Xu-helloworld avatar May 11 '24 06:05 Dawn-Xu-helloworld

Does that resolve the issue @ChaoII ?

arnfaldur avatar May 15 '24 00:05 arnfaldur

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions[bot] avatar Jun 29 '24 01:06 github-actions[bot]