llama.cpp
llama.cpp copied to clipboard
qwen 1.5 Beta 1.8B output incoherently
latest llama cpp output incoherently compare to Transformers output.
transformers/vllm work ok but llama cpp gguf does not
+1 Both Qwen1.5-72B-Chat and Qwen-72B-Chat output incoherently. The old llama_cpp which nearly 2023 Dec worked normal.
+1 Both Qwen1.5-72B-Chat and Qwen-72B-Chat output incoherently. The old llama_cpp which nearly 2023 Dec worked normal.
That's Great info to know. Can you pinpoint which version is the last version work ? If we can pinpoint which change is the cause of incoherence, it might get us close to solving the problem.
Same problem.
There are some mistakes in model config files . I used the Qwen1.5'gguf from huaggingface which run successfully. Maybe relate to this PR https://huggingface.co/Qwen/Qwen1.5-72B-Chat/commit/bc11a298a0c6a5cd737064db62c6ad20ec6331be
Hmm, I'm unsure that's the only issue. I chat-fine tuned and tried to quantize since then.
On Fri, Mar 1, 2024 at 1:36 AM weimy @.***> wrote:
There are some mistake in model config files . I used the Qwen1.5'gguf from huaggingface which run successfully. Maybe relate to this PR https://huggingface.co/Qwen/Qwen1.5-72B-Chat/commit/bc11a298a0c6a5cd737064db62c6ad20ec6331be
— Reply to this email directly, view it on GitHub https://github.com/ggerganov/llama.cpp/issues/5459#issuecomment-1972288063, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASVG6CVVN3PLRQJLRZ75QTDYV7LRFAVCNFSM6AAAAABDEP5PX6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZSGI4DQMBWGM . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Hmm, I'm unsure that's the only issue. I chat-fine tuned and tried to quantize since then. … On Fri, Mar 1, 2024 at 1:36 AM weimy @.> wrote: There are some mistake in model config files . I used the Qwen1.5'gguf from huaggingface which run successfully. Maybe relate to this PR https://huggingface.co/Qwen/Qwen1.5-72B-Chat/commit/bc11a298a0c6a5cd737064db62c6ad20ec6331be — Reply to this email directly, view it on GitHub <#5459 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASVG6CVVN3PLRQJLRZ75QTDYV7LRFAVCNFSM6AAAAABDEP5PX6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZSGI4DQMBWGM . You are receiving this because you are subscribed to this thread.Message ID: @.>
mostly,but there might me some conf need to adjust in EOS in the conf of the original model.
So, is this problem solved?
So, is this problem solved?
Not in the official repo
I have the same problem.
(llama) D:\llama.cpp\build\install\bin>main.exe -m D:/Qwen1.5-0.5B-Chat/ggml-model-f16.gguf -p "What's your name?"
Log start
main: build = 2725 (784e11de)
main: built with MSVC 19.35.32215.0 for
main: seed = 1714032293
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from D:/Qwen1.5-0.5B-Chat/ggml-model-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.name str = Qwen1.5-0.5B-Chat
llama_model_loader: - kv 2: qwen2.block_count u32 = 24
llama_model_loader: - kv 3: qwen2.context_length u32 = 32768
llama_model_loader: - kv 4: qwen2.embedding_length u32 = 1024
llama_model_loader: - kv 5: qwen2.feed_forward_length u32 = 2816
llama_model_loader: - kv 6: qwen2.attention.head_count u32 = 16
llama_model_loader: - kv 7: qwen2.attention.head_count_kv u32 = 16
llama_model_loader: - kv 8: qwen2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 9: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 10: general.file_type u32 = 1
llama_model_loader: - kv 11: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 13: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 14: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 15: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 16: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 18: tokenizer.chat_template str = {% for message in messages %}{% if lo...
llama_model_loader: - type f32: 121 tensors
llama_model_loader: - type f16: 170 tensors
llm_load_vocab: special tokens definition check successful ( 293/151936 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = qwen2
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 151936
llm_load_print_meta: n_merges = 151387
llm_load_print_meta: n_ctx_train = 32768
llm_load_print_meta: n_embd = 1024
llm_load_print_meta: n_head = 16
llm_load_print_meta: n_head_kv = 16
llm_load_print_meta: n_layer = 24
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_embd_head_k = 64
llm_load_print_meta: n_embd_head_v = 64
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 2816
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 32768
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 0.5B
llm_load_print_meta: model ftype = F16
llm_load_print_meta: model params = 619.57 M
llm_load_print_meta: model size = 1.15 GiB (16.00 BPW)
llm_load_print_meta: general.name = Qwen1.5-0.5B-Chat
llm_load_print_meta: BOS token = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token = 151645 '<|im_end|>'
llm_load_print_meta: PAD token = 151643 '<|endoftext|>'
llm_load_print_meta: LF token = 148848 'ÄĬ'
llm_load_print_meta: EOT token = 151645 '<|im_end|>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes
llm_load_tensors: ggml ctx size = 0.14 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/25 layers to GPU
llm_load_tensors: CPU buffer size = 1181.97 MiB
....................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA_Host KV buffer size = 48.00 MiB
llama_new_context_with_model: KV self size = 48.00 MiB, K (f16): 24.00 MiB, V (f16): 24.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.58 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 595.50 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 5.01 MiB
llama_new_context_with_model: graph nodes = 846
llama_new_context_with_model: graph splits = 340
system_info: n_threads = 10 / 20 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 |
sampling:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 512, n_batch = 2048, n_predict = -1, n_keep = 0
What's your name?<|im_end|> [end of text]
llama_print_timings: load time = 361.57 ms
llama_print_timings: sample time = 0.08 ms / 1 runs ( 0.08 ms per token, 12195.12 tokens per second)
llama_print_timings: prompt eval time = 37.07 ms / 5 tokens ( 7.41 ms per token, 134.88 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 38.76 ms / 6 tokens
Log end
This issue was closed because it has been inactive for 14 days since being marked as stale.