llama.cpp Bug: After converting the InternLM2 7b from LLamaFactory and importing it into ollama, i get an error: tensor 'token

What happened?

I fine-tuned the InternLM2 7b-chat model in LLamaFactory using a custom dataset and lora, exported the safetenors model and converted it to gguf format using convert_hf_to_gguf.py script, and finally imported it into ollama to run it, and ollama reported an error:

Error: llama runner process has terminated: signal: aborted (core dumped) error loading model: check_tensor_dims: tensor 'token_embd.weight' has wrong shape; expected  4096, 92550, got  4096, 92544,     1,     1
llama_load_model_from_file: exception loading model

python convert_hf_to_gguf log

INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Set model parameters
INFO:hf-to-gguf:Set model tokenizer
WARNING:hf-to-gguf:InternLM2 convert token 'b'\x00'' to '🐉'!
WARNING:hf-to-gguf:Replace eos:2 with a special token:92542 in chat mode so that the conversation can end normally.
INFO:gguf.vocab:Setting special token type bos to 1
INFO:gguf.vocab:Setting special token type eos to 92542
INFO:gguf.vocab:Setting special token type unk to 0
INFO:gguf.vocab:Setting special token type pad to 2
INFO:gguf.vocab:Setting add_bos_token to True
INFO:gguf.vocab:Setting add_eos_token to False
...
INFO:hf-to-gguf:output_norm.weight,          torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:output.weight,               torch.bfloat16 --> F16, shape = {4096, 92544}
...

quantize q4_0 log

...
llama_model_loader: - kv   0:                       general.architecture str              = internlm2
llama_model_loader: - kv   1:                               general.name str              = InternLM2
llama_model_loader: - kv   2:                   internlm2.context_length u32              = 32768
llama_model_loader: - kv   3:                      internlm2.block_count u32              = 32
llama_model_loader: - kv   4:                 internlm2.embedding_length u32              = 4096
llama_model_loader: - kv   5:              internlm2.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                   internlm2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv   7:             internlm2.attention.head_count u32              = 32
llama_model_loader: - kv   8: internlm2.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv   9:          internlm2.attention.head_count_kv u32              = 8
llama_model_loader: - kv  10:                          general.file_type u32              = 1
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,92550]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,92550]   = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,92550]   = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 92542
llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 2
llama_model_loader: - kv  21:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  22:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  23:                    tokenizer.chat_template str              = {{ '<s>' }}{% if messages[0]['role'] ...
llama_model_loader: - kv  24:               general.quantization_version u32              = 2
...

Name and Version

llama souce code version: b549a1bbefb2f1fbb8b558bac1f2ae7967e60964

What operating system are you seeing the problem on?

Linux

Relevant log output

### convert_hf_to_gguf

...
INFO:hf-to-gguf:blk.31.attn_output.weight,   torch.bfloat16 --> F16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.31.attn_q.weight,        torch.bfloat16 --> F16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.31.attn_k.weight,        torch.bfloat16 --> F16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.31.attn_v.weight,        torch.bfloat16 --> F16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.31.attn_norm.weight,     torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.31.ffn_gate.weight,      torch.bfloat16 --> F16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.31.ffn_down.weight,      torch.bfloat16 --> F16, shape = {14336, 4096}
INFO:hf-to-gguf:blk.31.ffn_up.weight,        torch.bfloat16 --> F16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.31.ffn_norm.weight,      torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:output_norm.weight,          torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:output.weight,               torch.bfloat16 --> F16, shape = {4096, 92544}
...

Jul 12 '24 05:07 Sakura4036

Does this also happen when using only llama.cpp code?

Jul 12 '24 06:07 JohannesGaessler

Does this also happen when using only llama.cpp code?

yes. I try to run the example code:

llama-cli -m my_model.gguf -p "I believe the meaning of life is" -n 128

get error:

...
llm_load_vocab: special tokens cache size = 457
llm_load_vocab: token to piece cache size = 0.5532 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = internlm2
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 92550
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 7.74 B
llm_load_print_meta: model size       = 14.41 GiB (16.00 BPW) 
llm_load_print_meta: general.name     = InternLM2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 92542 '<|im_end|>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 2 '</s>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: EOT token        = 92542 '<|im_end|>'
llm_load_print_meta: max token length = 384
llm_load_tensors: ggml ctx size =    0.14 MiB
llama_model_load: error loading model: check_tensor_dims: tensor 'token_embd.weight' has wrong shape; expected  4096, 92550, got  4096, 92544,     1,     1
llama_load_model_from_file: failed to load model
...

Jul 12 '24 06:07 Sakura4036

when i do the same sft training with qwen2-7b, llamafactory and llama.cpp both works fine, and the converted gguf model are availablle in ollama.

So, i think this is a bug for InternLM2 model which need a check and fix. @JohannesGaessler

Jul 14 '24 09:07 Sakura4036

llm_load_print_meta: n_vocab = 92550 INFO:hf-to-gguf:output.weight, torch.bfloat16 --> F16, shape = {4096, 92544}

@Sakura4036 The vocab size does not match the tensor size. Try to modify the vocab_size field in config.json to make it match, then re-convert the model.

Jul 17 '24 07:07 compilade

llm_load_print_meta: n_vocab = 92550 INFO:hf-to-gguf:output.weight, torch.bfloat16 --> F16, shape = {4096, 92544}

@Sakura4036 The vocab size does not match the tensor size. Try to modify the vocab_size field in config.json to make it match, then re-convert the model.

I tried to modify the vocab_size field in config.json from 92544 to 92550, and re-converted the model by convert_hf_to_gguf.py, but get an error:

Traceback (most recent call last):
  File "/home/github/llama.cpp/convert_hf_to_gguf.py", line 3583, in <module>
    main()
  File "/home/github/llama.cpp/convert_hf_to_gguf.py", line 3567, in main
    model_instance.set_vocab()
  File "/home/github/llama.cpp/convert_hf_to_gguf.py", line 2129, in set_vocab
    piece = tokenizer.IdToPiece(token_id)
  File "/home/anaconda3/envs/llama/lib/python3.10/site-packages/sentencepiece/__init__.py", line 1179, in _batched_func
    return _func(self, arg)
  File "/home/anaconda3/envs/llama/lib/python3.10/site-packages/sentencepiece/__init__.py", line 1172, in _func
    raise IndexError('piece id is out of range.')
IndexError: piece id is out of range.

Jul 17 '24 07:07 Sakura4036

I tried to modify the vocab_size field in config.json from 92544 to 92550

I meant to set it to 92544, to match the tensor size, but from what you say it was already that?

n_vocab comes from the number of tokens here:

llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,92550]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...

https://github.com/ggerganov/llama.cpp/blob/5e116e8dd51775f8f1c090570be148d5d7eea6c3/src/llama.cpp#L4653

So I would have guessed that setting vocab_size to 92544 to match the {4096, 92544}-sized tensor would have helped.

Jul 17 '24 07:07 compilade

@Sakura4036 Do you happen to have an added_tokens.json file in the same directory as the model? This seems like the only other thing than the vocab_size field which could affect the resulting vocab size.

Jul 17 '24 07:07 compilade

I tried to modify the vocab_size field in config.json from 92544 to 92550

I meant to set it to 92544, to match the tensor size, but from what you say it was already that?

n_vocab comes from the number of tokens here:
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,92550]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
https://github.com/ggerganov/llama.cpp/blob/5e116e8dd51775f8f1c090570be148d5d7eea6c3/src/llama.cpp#L4653

So I would have guessed that setting vocab_size to 92544 to match the {4096, 92544}-sized tensor would have helped.

Yes, when vocab_size is 92544 (and it was originally), using convert_hf_to_gguf.py doesn't report an error, but the gguf model doesn't work, i.e., the error I showed at the beginning

Jul 17 '24 07:07 Sakura4036

@Sakura4036 Other guess, do you happen to have an added_tokens.json file in the same directory as the model? This seems like the only other thing than the vocab_size field which could affect the resulting vocab size.

Yes, an add_tokens.json file does exist in the exported model folder. Should I delete it?

{
  "[UNUSED_TOKEN_141]": 92544,
  "[UNUSED_TOKEN_142]": 92545,
  "[UNUSED_TOKEN_143]": 92546,
  "[UNUSED_TOKEN_144]": 92547,
  "[UNUSED_TOKEN_145]": 92548,
  "[UNUSED_TOKEN_146]": 92549
}

Jul 17 '24 07:07 Sakura4036

Yes, an add_tokens.json file does exist in the exported model folder. Should I delete it?

Yes you can delete it (or you can rename the file to something else). These unused tokens don't map to anything in the model (according to the tensor sizes), and this is what makes n_vocab bigger than it should.

Jul 17 '24 07:07 compilade

Yes, an add_tokens.json file does exist in the exported model folder. Should I delete it?

Yes you can delete it (or you can rename the file to something else). These unused tokens don't map to anything in the model (according to the tensor sizes), and this is what makes n_vocab bigger than it should.

but these tokens also exist in tokenizer.json and tokenizer_config.json file, makes a error after deleting the added_token.json file.

Traceback (most recent call last):
  File "/home/github/llama.cpp/convert_hf_to_gguf.py", line 3583, in <module>
    main()
  File "/home/github/llama.cpp/convert_hf_to_gguf.py", line 3567, in main
    model_instance.set_vocab()
  File "/home/github/llama.cpp/convert_hf_to_gguf.py", line 2179, in set_vocab
    if toktypes[token_id] != SentencePieceTokenTypes.UNKNOWN:
IndexError: list index out of range

Jul 17 '24 07:07 Sakura4036

I wonder if this is a bug in the LLamaFactory export for InternLM2 model.

Jul 17 '24 07:07 Sakura4036

I delete added_token.json file and delete the six added token in tokenizer.json and tokenizer_config.json in model export folder. After that, I re-converted the exported model to gguf format, and import it in ollama, and ollama run it success. But the model performance is far from what it was before the llamaFactory merge adapter and export the model.

Jul 22 '24 10:07 Sakura4036

Same problem.

Jul 24 '24 16:07 marcello-sousa

I have the issue happening with me after I finetuned gemma-2-2b-it and tried to convert its lora...

Aug 03 '24 18:08 SulRash

@vansinhu same problem here. I finetuned the Internlm2_5-20b-chat model with xtuner and convert it to gguf with llama.cpp, then run it with ollama. And got the same problem, this issue block my full workflow of trying to use InternLM

Aug 23 '24 06:08 Jesean

I'm getting this when attempting to convert the base Internlm2_5-20b, without finetuning.

Removing the tokens for now. I haven't tested performance, but I don't see any theoretical reason for why there would be a gap.

@Sakura4036 I wonder if your performance gap is because you're finetuning with the additional tokens being used as ChatML tokens, such that removing them results in normal text tokenization, which mismatches your training tokenization.

Aug 25 '24 20:08 euclaise

@euclaise Why does this error occur if you just convert the base model without fine-tuning it?

Aug 26 '24 07:08 Sakura4036

This issue was closed because it has been inactive for 14 days since being marked as stale.

Oct 10 '24 01:10 github-actions[bot]

I'm encountering the same issue. I'm trying to use LLaMA Factory to fine-tune the DeepSeekV2-Lite-Chat model with LoRA, then merge the LoRA weights, convert the model to GGUF, and quantize it using the Q4_K_M method. However, when I try to run it with Ollama, I get the following error:

Error: llama runner process has terminated: error loading model: check_tensor_dims: tensor 'blk.0.attn_q.weight' has wrong shape; expected 2048, 576, got 2048, 3072, 1, 1

Apr 29 '25 06:04 Alen-Zeng

Bug: After converting the InternLM2 7b from LLamaFactory and importing it into ollama, i get an error: tensor 'token_embd.weight' has wrong shape.

What happened?

python convert_hf_to_gguf log

quantize q4_0 log

Name and Version

What operating system are you seeing the problem on?

Relevant log output