llama.cpp
llama.cpp copied to clipboard
MiniCPM 2b model support?
Feature Description
Like Phi is supported, it would great to have this Mistral level 2b model ggufable.
Motivation
SOTA 2b model, a piece of art, read how they made it:
https://shengdinghu.notion.site/MiniCPM-Unveiling-the-Potential-of-End-side-Large-Language-Models-d4d3a8c426424654a4e80e42a711cb20
Seems like the only unusual thing about this architecture is some modification related to mixing the input/output embeddings of the layers:
From this description, I'm not 100% sure what it means, but I suppose it would not be too difficult to implement
Most impressive 2B model ever see.
Thank you for interest in MiniCPM. I am one of the authors. In MiniCPM, we implement tie_word_embedding, which involves utilizing the same matrix for both input embedding and the output projection (lm_head). To adapt this from a standard architecture like Llama, you would need to make adjustments such as replacing lm_head.projection with something like input_embedding.projection.
Additionally, our model incorporates $\mu$P (https://arxiv.org/abs/2203.03466), a technique that applies numeric scaling to various model parameters and forward hiddens. Here are some specific modifications related to inference (not training) we've made:
| Modification Name | Specific Operation |
|---|---|
| Embedding Output Scaling | We multiply the output of the embedding by scale_emb=12. |
| Residual Connection Scaling | The increment at each residual connection in every layer is scaled by scale_depth/√(num_layers), which equals 1.4/√(40). |
| lm_head Scaling | The output logits are adjusted to 1/(dim_model/256) = 1/9 of their original value. |
These modifications can also be seen from our huggingface transformers code.
Llama.cpp is an exceptional framework for running edge-side LLMs. Please feel free to contact me at [email protected] if you have any questions or need further clarification.
Thank you for interest in MiniCPM. I am one of the authors. In MiniCPM, we implement
tie_word_embedding, which involves utilizing the same matrix for both input embedding and theoutput projection(lm_head). To adapt this from a standard architecture like Llama, you would need to make adjustments such as replacinglm_head.projectionwith something likeinput_embedding.projection.Additionally, our model incorporates $\mu$P (https://arxiv.org/abs/2203.03466), a technique that applies numeric scaling to various model parameters and forward hiddens. Here are some specific modifications related to inference (not training) we've made:
Modification Name Specific Operation Embedding Output Scaling We multiply the output of the embedding by scale_emb=12. Residual Connection Scaling The increment at each residual connection in every layer is scaled by scale_depth/√(num_layers), which equals 1.4/√(40). lm_head Scaling The output logits are adjusted to 1/(dim_model/256) = 1/9 of their original value. These modifications can also be seen from our huggingface transformers code.
Llama.cpp is an exceptional framework for running edge-side LLMs. Please feel free to contact me at [email protected] if you have any questions or need further clarification.
There is a branch of llama.cpp to support Minicpm in your official repository. Could u push it back here ?
@ShengdingHu
Thank you for interest in MiniCPM. I am one of the authors. In MiniCPM, we implement
tie_word_embedding, which involves utilizing the same matrix for both input embedding and theoutput projection(lm_head). To adapt this from a standard architecture like Llama, you would need to make adjustments such as replacinglm_head.projectionwith something likeinput_embedding.projection. Additionally, our model incorporates $\mu$P (https://arxiv.org/abs/2203.03466), a technique that applies numeric scaling to various model parameters and forward hiddens. Here are some specific modifications related to inference (not training) we've made: Modification Name Specific Operation Embedding Output Scaling We multiply the output of the embedding by scale_emb=12. Residual Connection Scaling The increment at each residual connection in every layer is scaled by scale_depth/√(num_layers), which equals 1.4/√(40). lm_head Scaling The output logits are adjusted to 1/(dim_model/256) = 1/9 of their original value. These modifications can also be seen from our huggingface transformers code. Llama.cpp is an exceptional framework for running edge-side LLMs. Please feel free to contact me at [email protected] if you have any questions or need further clarification.There is a branch of llama.cpp to support Minicpm in your official repository. Could u push it back here ?
@ShengdingHu
The branch here appears to have an issue with it. I am using the openbmb/MiniCPM-2B-dpo-fp16 model, when using the python3 convert.py model, the converted model lacks an output.weight
, resulting in an error. Additionally, the convert-hf-to-gguf.py script has not yet implemented support for MiniCPM, leading to the error: "Architecture 'MiniCPMForCausalLM' not supported!"
Thank you for interest in MiniCPM. I am one of the authors. In MiniCPM, we implement
tie_word_embedding, which involves utilizing the same matrix for both input embedding and theoutput projection(lm_head). To adapt this from a standard architecture like Llama, you would need to make adjustments such as replacinglm_head.projectionwith something likeinput_embedding.projection. Additionally, our model incorporates $\mu$P (https://arxiv.org/abs/2203.03466), a technique that applies numeric scaling to various model parameters and forward hiddens. Here are some specific modifications related to inference (not training) we've made: Modification Name Specific Operation Embedding Output Scaling We multiply the output of the embedding by scale_emb=12. Residual Connection Scaling The increment at each residual connection in every layer is scaled by scale_depth/√(num_layers), which equals 1.4/√(40). lm_head Scaling The output logits are adjusted to 1/(dim_model/256) = 1/9 of their original value. These modifications can also be seen from our huggingface transformers code. Llama.cpp is an exceptional framework for running edge-side LLMs. Please feel free to contact me at [email protected] if you have any questions or need further clarification.There is a branch of llama.cpp to support Minicpm in your official repository. Could u push it back here ? @ShengdingHu
The branch here appears to have an issue with it. I am using the openbmb/MiniCPM-2B-dpo-fp16 model, when using the python3 convert.py model, the converted model lacks an output.weight
, resulting in an error. Additionally, the convert-hf-to-gguf.py script has not yet implemented support for MiniCPM, leading to the error: "Architecture 'MiniCPMForCausalLM' not supported!"
Some changes are applied to other llama.cpp in this project:
https://github.com/zkh2016/llmfarm_core.swift/commit/c7de12db67a12b3c22367721d70f1c3228830116
Thank you for interest in MiniCPM. I am one of the authors. In MiniCPM, we implement
tie_word_embedding, which involves utilizing the same matrix for both input embedding and theoutput projection(lm_head). To adapt this from a standard architecture like Llama, you would need to make adjustments such as replacinglm_head.projectionwith something likeinput_embedding.projection. Additionally, our model incorporates $\mu$P (https://arxiv.org/abs/2203.03466), a technique that applies numeric scaling to various model parameters and forward hiddens. Here are some specific modifications related to inference (not training) we've made: Modification Name Specific Operation Embedding Output Scaling We multiply the output of the embedding by scale_emb=12. Residual Connection Scaling The increment at each residual connection in every layer is scaled by scale_depth/√(num_layers), which equals 1.4/√(40). lm_head Scaling The output logits are adjusted to 1/(dim_model/256) = 1/9 of their original value. These modifications can also be seen from our huggingface transformers code. Llama.cpp is an exceptional framework for running edge-side LLMs. Please feel free to contact me at [email protected] if you have any questions or need further clarification.There is a branch of llama.cpp to support Minicpm in your official repository. Could u push it back here ? @ShengdingHu
The branch here appears to have an issue with it. I am using the openbmb/MiniCPM-2B-dpo-fp16 model, when using the python3 convert.py model, the converted model lacks an output.weight
, resulting in an error. Additionally, the convert-hf-to-gguf.py script has not yet implemented support for MiniCPM, leading to the error: "Architecture 'MiniCPMForCausalLM' not supported!"
Some changes are applied to other llama.cpp in this project:
Thanks, problem solved! But the output makes no sense
Thank you for interest in MiniCPM. I am one of the authors. In MiniCPM, we implement
tie_word_embedding, which involves utilizing the same matrix for both input embedding and theoutput projection(lm_head). To adapt this from a standard architecture like Llama, you would need to make adjustments such as replacinglm_head.projectionwith something likeinput_embedding.projection. Additionally, our model incorporates $\mu$P (https://arxiv.org/abs/2203.03466), a technique that applies numeric scaling to various model parameters and forward hiddens. Here are some specific modifications related to inference (not training) we've made: Modification Name Specific Operation Embedding Output Scaling We multiply the output of the embedding by scale_emb=12. Residual Connection Scaling The increment at each residual connection in every layer is scaled by scale_depth/√(num_layers), which equals 1.4/√(40). lm_head Scaling The output logits are adjusted to 1/(dim_model/256) = 1/9 of their original value. These modifications can also be seen from our huggingface transformers code. Llama.cpp is an exceptional framework for running edge-side LLMs. Please feel free to contact me at [email protected] if you have any questions or need further clarification.There is a branch of llama.cpp to support Minicpm in your official repository. Could u push it back here ? @ShengdingHu
The branch here appears to have an issue with it. I am using the openbmb/MiniCPM-2B-dpo-fp16 model, when using the python3 convert.py model, the converted model lacks an output.weight
, resulting in an error. Additionally, the convert-hf-to-gguf.py script has not yet implemented support for MiniCPM, leading to the error: "Architecture 'MiniCPMForCausalLM' not supported!"
Some changes are applied to other llama.cpp in this project: zkh2016/llmfarm_core.swift@c7de12d
Thanks, problem solved! But the output makes no sense
Is the prompt template is correct ? Check this file :
https://github.com/OpenBMB/LLMFarm-MiniCPM/blob/main/LLMFarm/model_setting_templates/llama%20chat%202%207B%20iphone%2012.json
Thank you for interest in MiniCPM. I am one of the authors. In MiniCPM, we implement
tie_word_embedding, which involves utilizing the same matrix for both input embedding and theoutput projection(lm_head). To adapt this from a standard architecture like Llama, you would need to make adjustments such as replacinglm_head.projectionwith something likeinput_embedding.projection. Additionally, our model incorporates $\mu$P (https://arxiv.org/abs/2203.03466), a technique that applies numeric scaling to various model parameters and forward hiddens. Here are some specific modifications related to inference (not training) we've made: Modification Name Specific Operation Embedding Output Scaling We multiply the output of the embedding by scale_emb=12. Residual Connection Scaling The increment at each residual connection in every layer is scaled by scale_depth/√(num_layers), which equals 1.4/√(40). lm_head Scaling The output logits are adjusted to 1/(dim_model/256) = 1/9 of their original value. These modifications can also be seen from our huggingface transformers code. Llama.cpp is an exceptional framework for running edge-side LLMs. Please feel free to contact me at [email protected] if you have any questions or need further clarification.There is a branch of llama.cpp to support Minicpm in your official repository. Could u push it back here ? @ShengdingHu
The branch here appears to have an issue with it. I am using the openbmb/MiniCPM-2B-dpo-fp16 model, when using the python3 convert.py model, the converted model lacks an output.weight
, resulting in an error. Additionally, the convert-hf-to-gguf.py script has not yet implemented support for MiniCPM, leading to the error: "Architecture 'MiniCPMForCausalLM' not supported!"
Some changes are applied to other llama.cpp in this project: zkh2016/llmfarm_core.swift@c7de12d
Thanks, problem solved! But the output makes no sense
Maybe Issue of tokenizer?
Thank you for interest in MiniCPM. I am one of the authors. In MiniCPM, we implement
tie_word_embedding, which involves utilizing the same matrix for both input embedding and theoutput projection(lm_head). To adapt this from a standard architecture like Llama, you would need to make adjustments such as replacinglm_head.projectionwith something likeinput_embedding.projection.Additionally, our model incorporates $\mu$P (https://arxiv.org/abs/2203.03466), a technique that applies numeric scaling to various model parameters and forward hiddens. Here are some specific modifications related to inference (not training) we've made:
Modification Name Specific Operation Embedding Output Scaling We multiply the output of the embedding by scale_emb=12. Residual Connection Scaling The increment at each residual connection in every layer is scaled by scale_depth/√(num_layers), which equals 1.4/√(40). lm_head Scaling The output logits are adjusted to 1/(dim_model/256) = 1/9 of their original value. These modifications can also be seen from our huggingface transformers code.
Llama.cpp is an exceptional framework for running edge-side LLMs. Please feel free to contact me at [email protected] if you have any questions or need further clarification.
The information you summarized from the paper is very helpful. Thank you for your work.
Thank you for interest in MiniCPM. I am one of the authors. In MiniCPM, we implement
tie_word_embedding, which involves utilizing the same matrix for both input embedding and theoutput projection(lm_head). To adapt this from a standard architecture like Llama, you would need to make adjustments such as replacinglm_head.projectionwith something likeinput_embedding.projection. Additionally, our model incorporates $\mu$P (https://arxiv.org/abs/2203.03466), a technique that applies numeric scaling to various model parameters and forward hiddens. Here are some specific modifications related to inference (not training) we've made: Modification Name Specific Operation Embedding Output Scaling We multiply the output of the embedding by scale_emb=12. Residual Connection Scaling The increment at each residual connection in every layer is scaled by scale_depth/√(num_layers), which equals 1.4/√(40). lm_head Scaling The output logits are adjusted to 1/(dim_model/256) = 1/9 of their original value. These modifications can also be seen from our huggingface transformers code. Llama.cpp is an exceptional framework for running edge-side LLMs. Please feel free to contact me at [email protected] if you have any questions or need further clarification.There is a branch of llama.cpp to support Minicpm in your official repository. Could u push it back here ?
@ShengdingHu
Sure, we will look into it today!
Thank you for interest in MiniCPM. I am one of the authors. In MiniCPM, we implement
tie_word_embedding, which involves utilizing the same matrix for both input embedding and theoutput projection(lm_head). To adapt this from a standard architecture like Llama, you would need to make adjustments such as replacinglm_head.projectionwith something likeinput_embedding.projection. Additionally, our model incorporates $\mu$P (https://arxiv.org/abs/2203.03466), a technique that applies numeric scaling to various model parameters and forward hiddens. Here are some specific modifications related to inference (not training) we've made: Modification Name Specific Operation Embedding Output Scaling We multiply the output of the embedding by scale_emb=12. Residual Connection Scaling The increment at each residual connection in every layer is scaled by scale_depth/√(num_layers), which equals 1.4/√(40). lm_head Scaling The output logits are adjusted to 1/(dim_model/256) = 1/9 of their original value. These modifications can also be seen from our huggingface transformers code. Llama.cpp is an exceptional framework for running edge-side LLMs. Please feel free to contact me at [email protected] if you have any questions or need further clarification.There is a branch of llama.cpp to support Minicpm in your official repository. Could u push it back here ? @ShengdingHu
The branch here appears to have an issue with it. I am using the openbmb/MiniCPM-2B-dpo-fp16 model, when using the python3 convert.py model, the converted model lacks an output.weight
, resulting in an error. Additionally, the convert-hf-to-gguf.py script has not yet implemented support for MiniCPM, leading to the error: "Architecture 'MiniCPMForCausalLM' not supported!"
Some changes are applied to other llama.cpp in this project: zkh2016/llmfarm_core.swift@c7de12d
Thanks, problem solved! But the output makes no sense
Is the prompt template is correct ? Check this file :
https://github.com/OpenBMB/LLMFarm-MiniCPM/blob/main/LLMFarm/model_setting_templates/llama%20chat%202%207B%20iphone%2012.json
Thanks, the output might not be due to the template. Could you point me to the code that's generating the nonsensical output? I'm a bit lost in all the information. Is it directly produced by our zkh2016/llmfarm_core.swift@c7de12d?
Thank you for interest in MiniCPM. I am one of the authors. In MiniCPM, we implement
tie_word_embedding, which involves utilizing the same matrix for both input embedding and theoutput projection(lm_head). To adapt this from a standard architecture like Llama, you would need to make adjustments such as replacinglm_head.projectionwith something likeinput_embedding.projection. Additionally, our model incorporates $\mu$P (https://arxiv.org/abs/2203.03466), a technique that applies numeric scaling to various model parameters and forward hiddens. Here are some specific modifications related to inference (not training) we've made: Modification Name Specific Operation Embedding Output Scaling We multiply the output of the embedding by scale_emb=12. Residual Connection Scaling The increment at each residual connection in every layer is scaled by scale_depth/√(num_layers), which equals 1.4/√(40). lm_head Scaling The output logits are adjusted to 1/(dim_model/256) = 1/9 of their original value. These modifications can also be seen from our huggingface transformers code. Llama.cpp is an exceptional framework for running edge-side LLMs. Please feel free to contact me at [email protected] if you have any questions or need further clarification.There is a branch of llama.cpp to support Minicpm in your official repository. Could u push it back here ? @ShengdingHu
The branch here appears to have an issue with it. I am using the openbmb/MiniCPM-2B-dpo-fp16 model, when using the python3 convert.py model, the converted model lacks an output.weight
, resulting in an error. Additionally, the convert-hf-to-gguf.py script has not yet implemented support for MiniCPM, leading to the error: "Architecture 'MiniCPMForCausalLM' not supported!"
Some changes are applied to other llama.cpp in this project: zkh2016/llmfarm_core.swift@c7de12d
Thanks, problem solved! But the output makes no sense
Is the prompt template is correct ? Check this file : https://github.com/OpenBMB/LLMFarm-MiniCPM/blob/main/LLMFarm/model_setting_templates/llama%20chat%202%207B%20iphone%2012.json
Thanks, the output might not be due to the template. Could you point me to the code that's generating the nonsensical output? I'm a bit lost in all the information. Is it directly produced by our zkh2016/llmfarm_core.swift@c7de12d?
Not really, I actually referenced from this repo, and integrated it with zkh2016/llmfarm_core.swift@c7de12d
A good news is that we have converted the original checkpoints into Llama format. Specifically,
- we absorb the $mu$P scaling factors into the model checkpoints.
- we untie the heads and absorb the scaling factors into embedding and lm_head. (Although this might take more memory)
This produces a checkpoint that can be immediately loaded by LLama code. The huggingface repo is openbmb/MiniCPM-2B-dpo-bf16-llama-format
import torch
from transformers import LlamaTokenizerFast, LlamaForCausalLM
model_path = "openbmb/MiniCPM-2B-dpo-bf16-llama-format"
tokenizer = LlamaTokenizerFast.from_pretrained(model_path)
model = LlamaForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map='cuda', trust_remote_code=True)
prompt="Now you act like a terminal situated within a beginner's C++ practice repository folder, please provide the output for the command: `ls -l`"
input_ids = tokenizer.encode("<用户>{}<AI>".format(prompt), return_tensors='pt', add_special_tokens=True).cuda()
responds = model.generate(input_ids, temperature=0.3, top_p=0.8, repetition_penalty=1.02, max_length=1024)
responds = tokenizer.decode(responds[0], skip_special_tokens=True)
print(responds)
example output:
This might be a lot easier to be used in llama.cpp.
Could you help us add it to the supported models?
As for the minicpm code without converting to llama format, we think it might be the cause of lm_head not loaded from the input_embeddings. Could you help us check it?
Thanks a lot!
A good news is that we have converted the original checkpoints into Llama format. Specifically,
- we absorb the $mu$P scaling factors into the model checkpoints.
- we untie the heads and absorb the scaling factors into embedding and lm_head. (Although this might take more memory)
This produces a checkpoint that can be immediately loaded by LLama code. The huggingface repo is
openbmb/MiniCPM-2B-dpo-bf16-llama-formatimport torch from transformers import LlamaTokenizerFast, LlamaForCausalLM model_path = "openbmb/MiniCPM-2B-dpo-bf16-llama-format" tokenizer = LlamaTokenizerFast.from_pretrained(model_path) model = LlamaForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map='cuda', trust_remote_code=True) prompt="Now you act like a terminal situated within a beginner's C++ practice repository folder, please provide the output for the command: `ls -l`" input_ids = tokenizer.encode("<用户>{}<AI>".format(prompt), return_tensors='pt', add_special_tokens=True).cuda() responds = model.generate(input_ids, temperature=0.3, top_p=0.8, repetition_penalty=1.02, max_length=1024) responds = tokenizer.decode(responds[0], skip_special_tokens=True) print(responds)example output:
This might be a lot easier to be used in llama.cpp.
Could you help us add it to the supported models?
As for the minicpm code without converting to llama format, we think it might be the cause of lm_head not loaded from the input_embeddings. Could you help us check it?
Thanks a lot!
There is a pr to support MiniCPM 2b. Please help to check whether it works correctly.
https://github.com/ggerganov/llama.cpp/pull/5346
A good news is that we have converted the original checkpoints into Llama format. Specifically,
- we absorb the $mu$P scaling factors into the model checkpoints.
- we untie the heads and absorb the scaling factors into embedding and lm_head. (Although this might take more memory)
This produces a checkpoint that can be immediately loaded by LLama code. The huggingface repo is
openbmb/MiniCPM-2B-dpo-bf16-llama-formatimport torch from transformers import LlamaTokenizerFast, LlamaForCausalLM model_path = "openbmb/MiniCPM-2B-dpo-bf16-llama-format" tokenizer = LlamaTokenizerFast.from_pretrained(model_path) model = LlamaForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map='cuda', trust_remote_code=True) prompt="Now you act like a terminal situated within a beginner's C++ practice repository folder, please provide the output for the command: `ls -l`" input_ids = tokenizer.encode("<用户>{}<AI>".format(prompt), return_tensors='pt', add_special_tokens=True).cuda() responds = model.generate(input_ids, temperature=0.3, top_p=0.8, repetition_penalty=1.02, max_length=1024) responds = tokenizer.decode(responds[0], skip_special_tokens=True) print(responds)example output:
This might be a lot easier to be used in llama.cpp. Could you help us add it to the supported models? As for the minicpm code without converting to llama format, we think it might be the cause of lm_head not loaded from the input_embeddings. Could you help us check it? Thanks a lot!
There is a pr to support MiniCPM 2b. Please help to check whether it works correctly.
#5346
It seems that in the pr the model behaves strangely, I am checking it.
A good news is that we have converted the original checkpoints into Llama format. Specifically,
- we absorb the $mu$P scaling factors into the model checkpoints.
- we untie the heads and absorb the scaling factors into embedding and lm_head. (Although this might take more memory)
This produces a checkpoint that can be immediately loaded by LLama code. The huggingface repo is
openbmb/MiniCPM-2B-dpo-bf16-llama-formatimport torch from transformers import LlamaTokenizerFast, LlamaForCausalLM model_path = "openbmb/MiniCPM-2B-dpo-bf16-llama-format" tokenizer = LlamaTokenizerFast.from_pretrained(model_path) model = LlamaForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map='cuda', trust_remote_code=True) prompt="Now you act like a terminal situated within a beginner's C++ practice repository folder, please provide the output for the command: `ls -l`" input_ids = tokenizer.encode("<用户>{}<AI>".format(prompt), return_tensors='pt', add_special_tokens=True).cuda() responds = model.generate(input_ids, temperature=0.3, top_p=0.8, repetition_penalty=1.02, max_length=1024) responds = tokenizer.decode(responds[0], skip_special_tokens=True) print(responds)example output:
This might be a lot easier to be used in llama.cpp. Could you help us add it to the supported models? As for the minicpm code without converting to llama format, we think it might be the cause of lm_head not loaded from the input_embeddings. Could you help us check it? Thanks a lot!
There is a pr to support MiniCPM 2b. Please help to check whether it works correctly. #5346
It seems that in the pr the model behaves strangely, I am checking it.
Please check here: I've fixed the bug, welcome to do more test.
A good news is that we have converted the original checkpoints into Llama format. Specifically,
- we absorb the $mu$P scaling factors into the model checkpoints.
- we untie the heads and absorb the scaling factors into embedding and lm_head. (Although this might take more memory)
This produces a checkpoint that can be immediately loaded by LLama code. The huggingface repo is
openbmb/MiniCPM-2B-dpo-bf16-llama-formatimport torch from transformers import LlamaTokenizerFast, LlamaForCausalLM model_path = "openbmb/MiniCPM-2B-dpo-bf16-llama-format" tokenizer = LlamaTokenizerFast.from_pretrained(model_path) model = LlamaForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map='cuda', trust_remote_code=True) prompt="Now you act like a terminal situated within a beginner's C++ practice repository folder, please provide the output for the command: `ls -l`" input_ids = tokenizer.encode("<用户>{}<AI>".format(prompt), return_tensors='pt', add_special_tokens=True).cuda() responds = model.generate(input_ids, temperature=0.3, top_p=0.8, repetition_penalty=1.02, max_length=1024) responds = tokenizer.decode(responds[0], skip_special_tokens=True) print(responds)example output:
This might be a lot easier to be used in llama.cpp.
Could you help us add it to the supported models?
As for the minicpm code without converting to llama format, we think it might be the cause of lm_head not loaded from the input_embeddings. Could you help us check it?
Thanks a lot!
@ShengdingHu Could you please help test the latest release of llama.cpp, which includes support for converting and inferring with the original minicpm model? If everything works well, there are at least three benefits compared to providing a "llama-lized" minicpm model on Hugging Face:
- It simplifies your model publishing work, as there is no need to convert various kinds of models.
- It reduces confusion for users who encounter different types of models.
- It saves memory for both model storage and inference.
Thank you and I look forward to hearing your feedback.
A good news is that we have converted the original checkpoints into Llama format. Specifically,
- we absorb the $mu$P scaling factors into the model checkpoints.
- we untie the heads and absorb the scaling factors into embedding and lm_head. (Although this might take more memory)
This produces a checkpoint that can be immediately loaded by LLama code. The huggingface repo is
openbmb/MiniCPM-2B-dpo-bf16-llama-formatimport torch from transformers import LlamaTokenizerFast, LlamaForCausalLM model_path = "openbmb/MiniCPM-2B-dpo-bf16-llama-format" tokenizer = LlamaTokenizerFast.from_pretrained(model_path) model = LlamaForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map='cuda', trust_remote_code=True) prompt="Now you act like a terminal situated within a beginner's C++ practice repository folder, please provide the output for the command: `ls -l`" input_ids = tokenizer.encode("<用户>{}<AI>".format(prompt), return_tensors='pt', add_special_tokens=True).cuda() responds = model.generate(input_ids, temperature=0.3, top_p=0.8, repetition_penalty=1.02, max_length=1024) responds = tokenizer.decode(responds[0], skip_special_tokens=True) print(responds)example output:
This might be a lot easier to be used in llama.cpp. Could you help us add it to the supported models? As for the minicpm code without converting to llama format, we think it might be the cause of lm_head not loaded from the input_embeddings. Could you help us check it? Thanks a lot!
@ShengdingHu Could you please help test the latest release of llama.cpp, which includes support for converting and inferring with the original minicpm model? If everything works well, there are at least three benefits compared to providing a "llama-lized" minicpm model on Hugging Face:
- It simplifies your model publishing work, as there is no need to convert various kinds of models.
- It reduces confusion for users who encounter different types of models.
- It saves memory for both model storage and inference.
Thank you and I look forward to hearing your feedback.
That's definitely better than a llama-format, thanks very much. I am testing the lastest release.
Still getting
llama_model_load: error loading model: create_tensor: tensor 'output.weight' not found
./main --version
version: 2252 (525213d2)
Still getting
llama_model_load: error loading model: create_tensor: tensor 'output.weight' not found./main --version version: 2252 (525213d2)
I've just tested it, and it works. Please make sure you have converted the model using the latest version of convert-hf-to-gguf.py.
What's the status of this? They just released three new models, and it's as if they were reading my mind by creating 3B model with a huge context (could be great for summarization).
https://www.reddit.com/r/LocalLLaMA/comments/1c3badu/three_new_minicpm_models_moe_vision_128k/
I am still getting this on Apple Silicon:
$ make clean ; git pull origin ; make -j $(nproc)
$ conda activate llama
$ python3 -m pip install -U -r requirements.txt
$ python3 convert-hf-to-gguf.py models/openbmb/MiniCPM-2B-128k/
Loading model: MiniCPM-2B-128k
gguf: This GGUF file is for Little Endian only
Set model parameters
Set model tokenizer
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
gguf: Setting special token type bos to 1
gguf: Setting special token type eos to 2
gguf: Setting special token type unk to 0
gguf: Setting add_bos_token to True
gguf: Setting add_eos_token to False
gguf: Setting chat_template to {% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% for message in messages %}{{'<|im_start|>' + message['role'] + '
' + message['content'] + '<|im_end|>' + '
'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant
' }}{% endif %}
Exporting model to 'models/openbmb/MiniCPM-2B-128k/ggml-model-f16.gguf'
gguf: loading model part 'pytorch_model.bin'
/Users/gardner/miniconda3/envs/llama/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.__get__(instance, owner)()
token_embd.weight, n_dims = 2, torch.bfloat16 --> float16
output_norm.weight, n_dims = 1, torch.bfloat16 --> float32
Can not map tensor 'lm_head.weight'
It creates a zero-length file at models/openbmb/MiniCPM-2B-128k/ggml-model-f16.gguf
$ ls -lah models/openbmb/MiniCPM-2B-128k
total 11779952
drwxr-xr-x@ 14 gardner staff 448B 14 Apr 23:05 .
drwxr-xr-x@ 3 gardner staff 96B 14 Apr 22:59 ..
-rw-r--r--@ 1 gardner staff 1.5K 14 Apr 22:57 .gitattributes
-rw-r--r--@ 1 gardner staff 7.2K 14 Apr 22:57 README.md
-rw-r--r--@ 1 gardner staff 168B 14 Apr 22:57 added_tokens.json
-rw-r--r--@ 1 gardner staff 1.1K 14 Apr 22:57 config.json
-rw-r--r--@ 1 gardner staff 9.7K 14 Apr 22:57 configuration_minicpm.py
-rw-r--r--@ 1 gardner staff 0B 14 Apr 23:05 ggml-model-f16.gguf
-rw-r--r--@ 1 gardner staff 66K 14 Apr 22:57 modeling_minicpm.py
-rw-r--r--@ 1 gardner staff 5.6G 14 Apr 22:59 pytorch_model.bin
-rw-r--r--@ 1 gardner staff 574B 14 Apr 22:59 special_tokens_map.json
-rw-r--r--@ 1 gardner staff 5.9M 14 Apr 22:59 tokenizer.json
-rw-r--r--@ 1 gardner staff 1.9M 14 Apr 22:59 tokenizer.model
-rw-r--r--@ 1 gardner staff 2.6K 14 Apr 22:59 tokenizer_config.json
I am still getting this on Apple Silicon:
$ make clean ; git pull origin ; make -j $(nproc) $ conda activate llama $ python3 -m pip install -U -r requirements.txt $ python3 convert-hf-to-gguf.py models/openbmb/MiniCPM-2B-128k/ Loading model: MiniCPM-2B-128k gguf: This GGUF file is for Little Endian only Set model parameters Set model tokenizer Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. gguf: Setting special token type bos to 1 gguf: Setting special token type eos to 2 gguf: Setting special token type unk to 0 gguf: Setting add_bos_token to True gguf: Setting add_eos_token to False gguf: Setting chat_template to {% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% for message in messages %}{{'<|im_start|>' + message['role'] + ' ' + message['content'] + '<|im_end|>' + ' '}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant ' }}{% endif %} Exporting model to 'models/openbmb/MiniCPM-2B-128k/ggml-model-f16.gguf' gguf: loading model part 'pytorch_model.bin' /Users/gardner/miniconda3/envs/llama/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() return self.fget.__get__(instance, owner)() token_embd.weight, n_dims = 2, torch.bfloat16 --> float16 output_norm.weight, n_dims = 1, torch.bfloat16 --> float32 Can not map tensor 'lm_head.weight'It creates a zero-length file at
models/openbmb/MiniCPM-2B-128k/ggml-model-f16.gguf$ ls -lah models/openbmb/MiniCPM-2B-128k total 11779952 drwxr-xr-x@ 14 gardner staff 448B 14 Apr 23:05 . drwxr-xr-x@ 3 gardner staff 96B 14 Apr 22:59 .. -rw-r--r--@ 1 gardner staff 1.5K 14 Apr 22:57 .gitattributes -rw-r--r--@ 1 gardner staff 7.2K 14 Apr 22:57 README.md -rw-r--r--@ 1 gardner staff 168B 14 Apr 22:57 added_tokens.json -rw-r--r--@ 1 gardner staff 1.1K 14 Apr 22:57 config.json -rw-r--r--@ 1 gardner staff 9.7K 14 Apr 22:57 configuration_minicpm.py -rw-r--r--@ 1 gardner staff 0B 14 Apr 23:05 ggml-model-f16.gguf -rw-r--r--@ 1 gardner staff 66K 14 Apr 22:57 modeling_minicpm.py -rw-r--r--@ 1 gardner staff 5.6G 14 Apr 22:59 pytorch_model.bin -rw-r--r--@ 1 gardner staff 574B 14 Apr 22:59 special_tokens_map.json -rw-r--r--@ 1 gardner staff 5.9M 14 Apr 22:59 tokenizer.json -rw-r--r--@ 1 gardner staff 1.9M 14 Apr 22:59 tokenizer.model -rw-r--r--@ 1 gardner staff 2.6K 14 Apr 22:59 tokenizer_config.json
@gardner @flatsiedatsie The latest long context model couldn't be supported due to "removed tie_embedding and expanded the vocabulary to 127660". It could be solved by add some lines to process "MODEL_TENSOR.OUTPUT". However, it seems it's quite difficult to distinguish the new model from the older ones. @ShengdingHu any suggestions?
Hey, you can have a try with ChatLLM.cpp, 2B, 1B and MoE models are all supported. 😊
@fodl thanks for the suggestion, but unfortunately I'm relying on llamap-cpp-wasm / wllama to run these models 100% in the browser. ChatLLM.cpp does not seem to support that use case at the moment?
, resulting in an error. Additionally, the convert-hf-to-gguf.py script has not yet implemented support for MiniCPM, leading to the error: "Architecture 'MiniCPMForCausalLM' not supported!"


