llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

MiniCPM 2b model support?

Open KnutJaegersberg opened this issue 1 year ago • 26 comments
trafficstars

Feature Description

Like Phi is supported, it would great to have this Mistral level 2b model ggufable.

Motivation

SOTA 2b model, a piece of art, read how they made it:

https://shengdinghu.notion.site/MiniCPM-Unveiling-the-Potential-of-End-side-Large-Language-Models-d4d3a8c426424654a4e80e42a711cb20

KnutJaegersberg avatar Feb 02 '24 08:02 KnutJaegersberg

Seems like the only unusual thing about this architecture is some modification related to mixing the input/output embeddings of the layers:

image

From this description, I'm not 100% sure what it means, but I suppose it would not be too difficult to implement

ggerganov avatar Feb 02 '24 08:02 ggerganov

Most impressive 2B model ever see.

raymond-infinitecode avatar Feb 03 '24 11:02 raymond-infinitecode

Thank you for interest in MiniCPM. I am one of the authors. In MiniCPM, we implement tie_word_embedding, which involves utilizing the same matrix for both input embedding and the output projection (lm_head). To adapt this from a standard architecture like Llama, you would need to make adjustments such as replacing lm_head.projection with something like input_embedding.projection.

Additionally, our model incorporates $\mu$P (https://arxiv.org/abs/2203.03466), a technique that applies numeric scaling to various model parameters and forward hiddens. Here are some specific modifications related to inference (not training) we've made:

Modification Name Specific Operation
Embedding Output Scaling We multiply the output of the embedding by scale_emb=12.
Residual Connection Scaling The increment at each residual connection in every layer is scaled by scale_depth/√(num_layers), which equals 1.4/√(40).
lm_head Scaling The output logits are adjusted to 1/(dim_model/256) = 1/9 of their original value.

These modifications can also be seen from our huggingface transformers code.

Llama.cpp is an exceptional framework for running edge-side LLMs. Please feel free to contact me at [email protected] if you have any questions or need further clarification.

ShengdingHu avatar Feb 04 '24 14:02 ShengdingHu

Thank you for interest in MiniCPM. I am one of the authors. In MiniCPM, we implement tie_word_embedding, which involves utilizing the same matrix for both input embedding and the output projection (lm_head). To adapt this from a standard architecture like Llama, you would need to make adjustments such as replacing lm_head.projection with something like input_embedding.projection.

Additionally, our model incorporates $\mu$P (https://arxiv.org/abs/2203.03466), a technique that applies numeric scaling to various model parameters and forward hiddens. Here are some specific modifications related to inference (not training) we've made:

Modification Name Specific Operation Embedding Output Scaling We multiply the output of the embedding by scale_emb=12. Residual Connection Scaling The increment at each residual connection in every layer is scaled by scale_depth/√(num_layers), which equals 1.4/√(40). lm_head Scaling The output logits are adjusted to 1/(dim_model/256) = 1/9 of their original value. These modifications can also be seen from our huggingface transformers code.

Llama.cpp is an exceptional framework for running edge-side LLMs. Please feel free to contact me at [email protected] if you have any questions or need further clarification.

There is a branch of llama.cpp to support Minicpm in your official repository. Could u push it back here ?

@ShengdingHu

sweetcard avatar Feb 05 '24 04:02 sweetcard

Thank you for interest in MiniCPM. I am one of the authors. In MiniCPM, we implement tie_word_embedding, which involves utilizing the same matrix for both input embedding and the output projection (lm_head). To adapt this from a standard architecture like Llama, you would need to make adjustments such as replacing lm_head.projection with something like input_embedding.projection. Additionally, our model incorporates $\mu$P (https://arxiv.org/abs/2203.03466), a technique that applies numeric scaling to various model parameters and forward hiddens. Here are some specific modifications related to inference (not training) we've made: Modification Name Specific Operation Embedding Output Scaling We multiply the output of the embedding by scale_emb=12. Residual Connection Scaling The increment at each residual connection in every layer is scaled by scale_depth/√(num_layers), which equals 1.4/√(40). lm_head Scaling The output logits are adjusted to 1/(dim_model/256) = 1/9 of their original value. These modifications can also be seen from our huggingface transformers code. Llama.cpp is an exceptional framework for running edge-side LLMs. Please feel free to contact me at [email protected] if you have any questions or need further clarification.

There is a branch of llama.cpp to support Minicpm in your official repository. Could u push it back here ?

@ShengdingHu

The branch here appears to have an issue with it. I am using the openbmb/MiniCPM-2B-dpo-fp16 model, when using the python3 convert.py model, the converted model lacks an output.weight Screen Shot 2024-02-05 at 4 37 21 PM , resulting in an error. Additionally, the convert-hf-to-gguf.py script has not yet implemented support for MiniCPM, leading to the error: "Architecture 'MiniCPMForCausalLM' not supported!"

lzs0603 avatar Feb 05 '24 09:02 lzs0603

Thank you for interest in MiniCPM. I am one of the authors. In MiniCPM, we implement tie_word_embedding, which involves utilizing the same matrix for both input embedding and the output projection (lm_head). To adapt this from a standard architecture like Llama, you would need to make adjustments such as replacing lm_head.projection with something like input_embedding.projection. Additionally, our model incorporates $\mu$P (https://arxiv.org/abs/2203.03466), a technique that applies numeric scaling to various model parameters and forward hiddens. Here are some specific modifications related to inference (not training) we've made: Modification Name Specific Operation Embedding Output Scaling We multiply the output of the embedding by scale_emb=12. Residual Connection Scaling The increment at each residual connection in every layer is scaled by scale_depth/√(num_layers), which equals 1.4/√(40). lm_head Scaling The output logits are adjusted to 1/(dim_model/256) = 1/9 of their original value. These modifications can also be seen from our huggingface transformers code. Llama.cpp is an exceptional framework for running edge-side LLMs. Please feel free to contact me at [email protected] if you have any questions or need further clarification.

There is a branch of llama.cpp to support Minicpm in your official repository. Could u push it back here ? @ShengdingHu

The branch here appears to have an issue with it. I am using the openbmb/MiniCPM-2B-dpo-fp16 model, when using the python3 convert.py model, the converted model lacks an output.weight Screen Shot 2024-02-05 at 4 37 21 PM , resulting in an error. Additionally, the convert-hf-to-gguf.py script has not yet implemented support for MiniCPM, leading to the error: "Architecture 'MiniCPMForCausalLM' not supported!"

Some changes are applied to other llama.cpp in this project:

https://github.com/zkh2016/llmfarm_core.swift/commit/c7de12db67a12b3c22367721d70f1c3228830116

sweetcard avatar Feb 05 '24 09:02 sweetcard

Thank you for interest in MiniCPM. I am one of the authors. In MiniCPM, we implement tie_word_embedding, which involves utilizing the same matrix for both input embedding and the output projection (lm_head). To adapt this from a standard architecture like Llama, you would need to make adjustments such as replacing lm_head.projection with something like input_embedding.projection. Additionally, our model incorporates $\mu$P (https://arxiv.org/abs/2203.03466), a technique that applies numeric scaling to various model parameters and forward hiddens. Here are some specific modifications related to inference (not training) we've made: Modification Name Specific Operation Embedding Output Scaling We multiply the output of the embedding by scale_emb=12. Residual Connection Scaling The increment at each residual connection in every layer is scaled by scale_depth/√(num_layers), which equals 1.4/√(40). lm_head Scaling The output logits are adjusted to 1/(dim_model/256) = 1/9 of their original value. These modifications can also be seen from our huggingface transformers code. Llama.cpp is an exceptional framework for running edge-side LLMs. Please feel free to contact me at [email protected] if you have any questions or need further clarification.

There is a branch of llama.cpp to support Minicpm in your official repository. Could u push it back here ? @ShengdingHu

The branch here appears to have an issue with it. I am using the openbmb/MiniCPM-2B-dpo-fp16 model, when using the python3 convert.py model, the converted model lacks an output.weight Screen Shot 2024-02-05 at 4 37 21 PM , resulting in an error. Additionally, the convert-hf-to-gguf.py script has not yet implemented support for MiniCPM, leading to the error: "Architecture 'MiniCPMForCausalLM' not supported!"

Some changes are applied to other llama.cpp in this project:

zkh2016/llmfarm_core.swift@c7de12d

Thanks, problem solved! But the output makes no sense Screen Shot 2024-02-05 at 5 48 16 PM

lzs0603 avatar Feb 05 '24 09:02 lzs0603

Thank you for interest in MiniCPM. I am one of the authors. In MiniCPM, we implement tie_word_embedding, which involves utilizing the same matrix for both input embedding and the output projection (lm_head). To adapt this from a standard architecture like Llama, you would need to make adjustments such as replacing lm_head.projection with something like input_embedding.projection. Additionally, our model incorporates $\mu$P (https://arxiv.org/abs/2203.03466), a technique that applies numeric scaling to various model parameters and forward hiddens. Here are some specific modifications related to inference (not training) we've made: Modification Name Specific Operation Embedding Output Scaling We multiply the output of the embedding by scale_emb=12. Residual Connection Scaling The increment at each residual connection in every layer is scaled by scale_depth/√(num_layers), which equals 1.4/√(40). lm_head Scaling The output logits are adjusted to 1/(dim_model/256) = 1/9 of their original value. These modifications can also be seen from our huggingface transformers code. Llama.cpp is an exceptional framework for running edge-side LLMs. Please feel free to contact me at [email protected] if you have any questions or need further clarification.

There is a branch of llama.cpp to support Minicpm in your official repository. Could u push it back here ? @ShengdingHu

The branch here appears to have an issue with it. I am using the openbmb/MiniCPM-2B-dpo-fp16 model, when using the python3 convert.py model, the converted model lacks an output.weight Screen Shot 2024-02-05 at 4 37 21 PM , resulting in an error. Additionally, the convert-hf-to-gguf.py script has not yet implemented support for MiniCPM, leading to the error: "Architecture 'MiniCPMForCausalLM' not supported!"

Some changes are applied to other llama.cpp in this project: zkh2016/llmfarm_core.swift@c7de12d

Thanks, problem solved! But the output makes no sense Screen Shot 2024-02-05 at 5 48 16 PM

Is the prompt template is correct ? Check this file :

https://github.com/OpenBMB/LLMFarm-MiniCPM/blob/main/LLMFarm/model_setting_templates/llama%20chat%202%207B%20iphone%2012.json

sweetcard avatar Feb 05 '24 09:02 sweetcard

Thank you for interest in MiniCPM. I am one of the authors. In MiniCPM, we implement tie_word_embedding, which involves utilizing the same matrix for both input embedding and the output projection (lm_head). To adapt this from a standard architecture like Llama, you would need to make adjustments such as replacing lm_head.projection with something like input_embedding.projection. Additionally, our model incorporates $\mu$P (https://arxiv.org/abs/2203.03466), a technique that applies numeric scaling to various model parameters and forward hiddens. Here are some specific modifications related to inference (not training) we've made: Modification Name Specific Operation Embedding Output Scaling We multiply the output of the embedding by scale_emb=12. Residual Connection Scaling The increment at each residual connection in every layer is scaled by scale_depth/√(num_layers), which equals 1.4/√(40). lm_head Scaling The output logits are adjusted to 1/(dim_model/256) = 1/9 of their original value. These modifications can also be seen from our huggingface transformers code. Llama.cpp is an exceptional framework for running edge-side LLMs. Please feel free to contact me at [email protected] if you have any questions or need further clarification.

There is a branch of llama.cpp to support Minicpm in your official repository. Could u push it back here ? @ShengdingHu

The branch here appears to have an issue with it. I am using the openbmb/MiniCPM-2B-dpo-fp16 model, when using the python3 convert.py model, the converted model lacks an output.weight Screen Shot 2024-02-05 at 4 37 21 PM , resulting in an error. Additionally, the convert-hf-to-gguf.py script has not yet implemented support for MiniCPM, leading to the error: "Architecture 'MiniCPMForCausalLM' not supported!"

Some changes are applied to other llama.cpp in this project: zkh2016/llmfarm_core.swift@c7de12d

Thanks, problem solved! But the output makes no sense Screen Shot 2024-02-05 at 5 48 16 PM

Maybe Issue of tokenizer?

lin-calvin avatar Feb 05 '24 10:02 lin-calvin

Thank you for interest in MiniCPM. I am one of the authors. In MiniCPM, we implement tie_word_embedding, which involves utilizing the same matrix for both input embedding and the output projection (lm_head). To adapt this from a standard architecture like Llama, you would need to make adjustments such as replacing lm_head.projection with something like input_embedding.projection.

Additionally, our model incorporates $\mu$P (https://arxiv.org/abs/2203.03466), a technique that applies numeric scaling to various model parameters and forward hiddens. Here are some specific modifications related to inference (not training) we've made:

Modification Name Specific Operation Embedding Output Scaling We multiply the output of the embedding by scale_emb=12. Residual Connection Scaling The increment at each residual connection in every layer is scaled by scale_depth/√(num_layers), which equals 1.4/√(40). lm_head Scaling The output logits are adjusted to 1/(dim_model/256) = 1/9 of their original value. These modifications can also be seen from our huggingface transformers code.

Llama.cpp is an exceptional framework for running edge-side LLMs. Please feel free to contact me at [email protected] if you have any questions or need further clarification.

The information you summarized from the paper is very helpful. Thank you for your work.

runfuture avatar Feb 05 '24 14:02 runfuture

Thank you for interest in MiniCPM. I am one of the authors. In MiniCPM, we implement tie_word_embedding, which involves utilizing the same matrix for both input embedding and the output projection (lm_head). To adapt this from a standard architecture like Llama, you would need to make adjustments such as replacing lm_head.projection with something like input_embedding.projection. Additionally, our model incorporates $\mu$P (https://arxiv.org/abs/2203.03466), a technique that applies numeric scaling to various model parameters and forward hiddens. Here are some specific modifications related to inference (not training) we've made: Modification Name Specific Operation Embedding Output Scaling We multiply the output of the embedding by scale_emb=12. Residual Connection Scaling The increment at each residual connection in every layer is scaled by scale_depth/√(num_layers), which equals 1.4/√(40). lm_head Scaling The output logits are adjusted to 1/(dim_model/256) = 1/9 of their original value. These modifications can also be seen from our huggingface transformers code. Llama.cpp is an exceptional framework for running edge-side LLMs. Please feel free to contact me at [email protected] if you have any questions or need further clarification.

There is a branch of llama.cpp to support Minicpm in your official repository. Could u push it back here ?

@ShengdingHu

Sure, we will look into it today!

ShengdingHu avatar Feb 05 '24 16:02 ShengdingHu

Thank you for interest in MiniCPM. I am one of the authors. In MiniCPM, we implement tie_word_embedding, which involves utilizing the same matrix for both input embedding and the output projection (lm_head). To adapt this from a standard architecture like Llama, you would need to make adjustments such as replacing lm_head.projection with something like input_embedding.projection. Additionally, our model incorporates $\mu$P (https://arxiv.org/abs/2203.03466), a technique that applies numeric scaling to various model parameters and forward hiddens. Here are some specific modifications related to inference (not training) we've made: Modification Name Specific Operation Embedding Output Scaling We multiply the output of the embedding by scale_emb=12. Residual Connection Scaling The increment at each residual connection in every layer is scaled by scale_depth/√(num_layers), which equals 1.4/√(40). lm_head Scaling The output logits are adjusted to 1/(dim_model/256) = 1/9 of their original value. These modifications can also be seen from our huggingface transformers code. Llama.cpp is an exceptional framework for running edge-side LLMs. Please feel free to contact me at [email protected] if you have any questions or need further clarification.

There is a branch of llama.cpp to support Minicpm in your official repository. Could u push it back here ? @ShengdingHu

The branch here appears to have an issue with it. I am using the openbmb/MiniCPM-2B-dpo-fp16 model, when using the python3 convert.py model, the converted model lacks an output.weight Screen Shot 2024-02-05 at 4 37 21 PM , resulting in an error. Additionally, the convert-hf-to-gguf.py script has not yet implemented support for MiniCPM, leading to the error: "Architecture 'MiniCPMForCausalLM' not supported!"

Some changes are applied to other llama.cpp in this project: zkh2016/llmfarm_core.swift@c7de12d

Thanks, problem solved! But the output makes no sense Screen Shot 2024-02-05 at 5 48 16 PM

Is the prompt template is correct ? Check this file :

https://github.com/OpenBMB/LLMFarm-MiniCPM/blob/main/LLMFarm/model_setting_templates/llama%20chat%202%207B%20iphone%2012.json

Thanks, the output might not be due to the template. Could you point me to the code that's generating the nonsensical output? I'm a bit lost in all the information. Is it directly produced by our zkh2016/llmfarm_core.swift@c7de12d?

ShengdingHu avatar Feb 05 '24 16:02 ShengdingHu

Thank you for interest in MiniCPM. I am one of the authors. In MiniCPM, we implement tie_word_embedding, which involves utilizing the same matrix for both input embedding and the output projection (lm_head). To adapt this from a standard architecture like Llama, you would need to make adjustments such as replacing lm_head.projection with something like input_embedding.projection. Additionally, our model incorporates $\mu$P (https://arxiv.org/abs/2203.03466), a technique that applies numeric scaling to various model parameters and forward hiddens. Here are some specific modifications related to inference (not training) we've made: Modification Name Specific Operation Embedding Output Scaling We multiply the output of the embedding by scale_emb=12. Residual Connection Scaling The increment at each residual connection in every layer is scaled by scale_depth/√(num_layers), which equals 1.4/√(40). lm_head Scaling The output logits are adjusted to 1/(dim_model/256) = 1/9 of their original value. These modifications can also be seen from our huggingface transformers code. Llama.cpp is an exceptional framework for running edge-side LLMs. Please feel free to contact me at [email protected] if you have any questions or need further clarification.

There is a branch of llama.cpp to support Minicpm in your official repository. Could u push it back here ? @ShengdingHu

The branch here appears to have an issue with it. I am using the openbmb/MiniCPM-2B-dpo-fp16 model, when using the python3 convert.py model, the converted model lacks an output.weight Screen Shot 2024-02-05 at 4 37 21 PM , resulting in an error. Additionally, the convert-hf-to-gguf.py script has not yet implemented support for MiniCPM, leading to the error: "Architecture 'MiniCPMForCausalLM' not supported!"

Some changes are applied to other llama.cpp in this project: zkh2016/llmfarm_core.swift@c7de12d

Thanks, problem solved! But the output makes no sense Screen Shot 2024-02-05 at 5 48 16 PM

Is the prompt template is correct ? Check this file : https://github.com/OpenBMB/LLMFarm-MiniCPM/blob/main/LLMFarm/model_setting_templates/llama%20chat%202%207B%20iphone%2012.json

Thanks, the output might not be due to the template. Could you point me to the code that's generating the nonsensical output? I'm a bit lost in all the information. Is it directly produced by our zkh2016/llmfarm_core.swift@c7de12d?

Not really, I actually referenced from this repo, and integrated it with zkh2016/llmfarm_core.swift@c7de12d

lzs0603 avatar Feb 06 '24 05:02 lzs0603

A good news is that we have converted the original checkpoints into Llama format. Specifically,

  1. we absorb the $mu$P scaling factors into the model checkpoints.
  2. we untie the heads and absorb the scaling factors into embedding and lm_head. (Although this might take more memory)

This produces a checkpoint that can be immediately loaded by LLama code. The huggingface repo is openbmb/MiniCPM-2B-dpo-bf16-llama-format

import torch
from transformers import LlamaTokenizerFast, LlamaForCausalLM
model_path = "openbmb/MiniCPM-2B-dpo-bf16-llama-format"
tokenizer = LlamaTokenizerFast.from_pretrained(model_path)
model = LlamaForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map='cuda', trust_remote_code=True)

prompt="Now you act like a terminal situated within a beginner's C++ practice repository folder, please provide the output for the command: `ls -l`"
input_ids = tokenizer.encode("<用户>{}<AI>".format(prompt), return_tensors='pt', add_special_tokens=True).cuda()
responds = model.generate(input_ids, temperature=0.3, top_p=0.8, repetition_penalty=1.02, max_length=1024)
responds = tokenizer.decode(responds[0], skip_special_tokens=True)
print(responds)

example output:

image

This might be a lot easier to be used in llama.cpp.

Could you help us add it to the supported models?

As for the minicpm code without converting to llama format, we think it might be the cause of lm_head not loaded from the input_embeddings. Could you help us check it?

Thanks a lot!

ShengdingHu avatar Feb 07 '24 07:02 ShengdingHu

A good news is that we have converted the original checkpoints into Llama format. Specifically,

  1. we absorb the $mu$P scaling factors into the model checkpoints.
  2. we untie the heads and absorb the scaling factors into embedding and lm_head. (Although this might take more memory)

This produces a checkpoint that can be immediately loaded by LLama code. The huggingface repo is openbmb/MiniCPM-2B-dpo-bf16-llama-format

import torch
from transformers import LlamaTokenizerFast, LlamaForCausalLM
model_path = "openbmb/MiniCPM-2B-dpo-bf16-llama-format"
tokenizer = LlamaTokenizerFast.from_pretrained(model_path)
model = LlamaForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map='cuda', trust_remote_code=True)

prompt="Now you act like a terminal situated within a beginner's C++ practice repository folder, please provide the output for the command: `ls -l`"
input_ids = tokenizer.encode("<用户>{}<AI>".format(prompt), return_tensors='pt', add_special_tokens=True).cuda()
responds = model.generate(input_ids, temperature=0.3, top_p=0.8, repetition_penalty=1.02, max_length=1024)
responds = tokenizer.decode(responds[0], skip_special_tokens=True)
print(responds)

example output:

image

This might be a lot easier to be used in llama.cpp.

Could you help us add it to the supported models?

As for the minicpm code without converting to llama format, we think it might be the cause of lm_head not loaded from the input_embeddings. Could you help us check it?

Thanks a lot!

There is a pr to support MiniCPM 2b. Please help to check whether it works correctly.

https://github.com/ggerganov/llama.cpp/pull/5346

sweetcard avatar Feb 07 '24 07:02 sweetcard

A good news is that we have converted the original checkpoints into Llama format. Specifically,

  1. we absorb the $mu$P scaling factors into the model checkpoints.
  2. we untie the heads and absorb the scaling factors into embedding and lm_head. (Although this might take more memory)

This produces a checkpoint that can be immediately loaded by LLama code. The huggingface repo is openbmb/MiniCPM-2B-dpo-bf16-llama-format

import torch
from transformers import LlamaTokenizerFast, LlamaForCausalLM
model_path = "openbmb/MiniCPM-2B-dpo-bf16-llama-format"
tokenizer = LlamaTokenizerFast.from_pretrained(model_path)
model = LlamaForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map='cuda', trust_remote_code=True)

prompt="Now you act like a terminal situated within a beginner's C++ practice repository folder, please provide the output for the command: `ls -l`"
input_ids = tokenizer.encode("<用户>{}<AI>".format(prompt), return_tensors='pt', add_special_tokens=True).cuda()
responds = model.generate(input_ids, temperature=0.3, top_p=0.8, repetition_penalty=1.02, max_length=1024)
responds = tokenizer.decode(responds[0], skip_special_tokens=True)
print(responds)

example output: image This might be a lot easier to be used in llama.cpp. Could you help us add it to the supported models? As for the minicpm code without converting to llama format, we think it might be the cause of lm_head not loaded from the input_embeddings. Could you help us check it? Thanks a lot!

There is a pr to support MiniCPM 2b. Please help to check whether it works correctly.

#5346

It seems that in the pr the model behaves strangely, I am checking it.

ShengdingHu avatar Feb 07 '24 08:02 ShengdingHu

A good news is that we have converted the original checkpoints into Llama format. Specifically,

  1. we absorb the $mu$P scaling factors into the model checkpoints.
  2. we untie the heads and absorb the scaling factors into embedding and lm_head. (Although this might take more memory)

This produces a checkpoint that can be immediately loaded by LLama code. The huggingface repo is openbmb/MiniCPM-2B-dpo-bf16-llama-format

import torch
from transformers import LlamaTokenizerFast, LlamaForCausalLM
model_path = "openbmb/MiniCPM-2B-dpo-bf16-llama-format"
tokenizer = LlamaTokenizerFast.from_pretrained(model_path)
model = LlamaForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map='cuda', trust_remote_code=True)

prompt="Now you act like a terminal situated within a beginner's C++ practice repository folder, please provide the output for the command: `ls -l`"
input_ids = tokenizer.encode("<用户>{}<AI>".format(prompt), return_tensors='pt', add_special_tokens=True).cuda()
responds = model.generate(input_ids, temperature=0.3, top_p=0.8, repetition_penalty=1.02, max_length=1024)
responds = tokenizer.decode(responds[0], skip_special_tokens=True)
print(responds)

example output: image This might be a lot easier to be used in llama.cpp. Could you help us add it to the supported models? As for the minicpm code without converting to llama format, we think it might be the cause of lm_head not loaded from the input_embeddings. Could you help us check it? Thanks a lot!

There is a pr to support MiniCPM 2b. Please help to check whether it works correctly. #5346

It seems that in the pr the model behaves strangely, I am checking it.

Please check here: I've fixed the bug, welcome to do more test.

runfuture avatar Feb 07 '24 15:02 runfuture

A good news is that we have converted the original checkpoints into Llama format. Specifically,

  1. we absorb the $mu$P scaling factors into the model checkpoints.
  2. we untie the heads and absorb the scaling factors into embedding and lm_head. (Although this might take more memory)

This produces a checkpoint that can be immediately loaded by LLama code. The huggingface repo is openbmb/MiniCPM-2B-dpo-bf16-llama-format

import torch
from transformers import LlamaTokenizerFast, LlamaForCausalLM
model_path = "openbmb/MiniCPM-2B-dpo-bf16-llama-format"
tokenizer = LlamaTokenizerFast.from_pretrained(model_path)
model = LlamaForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map='cuda', trust_remote_code=True)

prompt="Now you act like a terminal situated within a beginner's C++ practice repository folder, please provide the output for the command: `ls -l`"
input_ids = tokenizer.encode("<用户>{}<AI>".format(prompt), return_tensors='pt', add_special_tokens=True).cuda()
responds = model.generate(input_ids, temperature=0.3, top_p=0.8, repetition_penalty=1.02, max_length=1024)
responds = tokenizer.decode(responds[0], skip_special_tokens=True)
print(responds)

example output:

image

This might be a lot easier to be used in llama.cpp.

Could you help us add it to the supported models?

As for the minicpm code without converting to llama format, we think it might be the cause of lm_head not loaded from the input_embeddings. Could you help us check it?

Thanks a lot!

@ShengdingHu Could you please help test the latest release of llama.cpp, which includes support for converting and inferring with the original minicpm model? If everything works well, there are at least three benefits compared to providing a "llama-lized" minicpm model on Hugging Face:

  1. It simplifies your model publishing work, as there is no need to convert various kinds of models.
  2. It reduces confusion for users who encounter different types of models.
  3. It saves memory for both model storage and inference.

Thank you and I look forward to hearing your feedback.

runfuture avatar Feb 08 '24 15:02 runfuture

A good news is that we have converted the original checkpoints into Llama format. Specifically,

  1. we absorb the $mu$P scaling factors into the model checkpoints.
  2. we untie the heads and absorb the scaling factors into embedding and lm_head. (Although this might take more memory)

This produces a checkpoint that can be immediately loaded by LLama code. The huggingface repo is openbmb/MiniCPM-2B-dpo-bf16-llama-format

import torch
from transformers import LlamaTokenizerFast, LlamaForCausalLM
model_path = "openbmb/MiniCPM-2B-dpo-bf16-llama-format"
tokenizer = LlamaTokenizerFast.from_pretrained(model_path)
model = LlamaForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map='cuda', trust_remote_code=True)

prompt="Now you act like a terminal situated within a beginner's C++ practice repository folder, please provide the output for the command: `ls -l`"
input_ids = tokenizer.encode("<用户>{}<AI>".format(prompt), return_tensors='pt', add_special_tokens=True).cuda()
responds = model.generate(input_ids, temperature=0.3, top_p=0.8, repetition_penalty=1.02, max_length=1024)
responds = tokenizer.decode(responds[0], skip_special_tokens=True)
print(responds)

example output: image This might be a lot easier to be used in llama.cpp. Could you help us add it to the supported models? As for the minicpm code without converting to llama format, we think it might be the cause of lm_head not loaded from the input_embeddings. Could you help us check it? Thanks a lot!

@ShengdingHu Could you please help test the latest release of llama.cpp, which includes support for converting and inferring with the original minicpm model? If everything works well, there are at least three benefits compared to providing a "llama-lized" minicpm model on Hugging Face:

  1. It simplifies your model publishing work, as there is no need to convert various kinds of models.
  2. It reduces confusion for users who encounter different types of models.
  3. It saves memory for both model storage and inference.

Thank you and I look forward to hearing your feedback.

That's definitely better than a llama-format, thanks very much. I am testing the lastest release.

ShengdingHu avatar Feb 09 '24 02:02 ShengdingHu

Still getting

llama_model_load: error loading model: create_tensor: tensor 'output.weight' not found
./main --version
version: 2252 (525213d2)

gardner avatar Feb 24 '24 13:02 gardner

Still getting

llama_model_load: error loading model: create_tensor: tensor 'output.weight' not found
./main --version
version: 2252 (525213d2)

I've just tested it, and it works. Please make sure you have converted the model using the latest version of convert-hf-to-gguf.py.

runfuture avatar Feb 25 '24 06:02 runfuture

What's the status of this? They just released three new models, and it's as if they were reading my mind by creating 3B model with a huge context (could be great for summarization).

https://www.reddit.com/r/LocalLLaMA/comments/1c3badu/three_new_minicpm_models_moe_vision_128k/

flatsiedatsie avatar Apr 14 '24 08:04 flatsiedatsie

I am still getting this on Apple Silicon:

$ make clean ; git pull origin ; make -j $(nproc)
$ conda activate llama
$ python3 -m pip install -U -r requirements.txt

$ python3 convert-hf-to-gguf.py models/openbmb/MiniCPM-2B-128k/
Loading model: MiniCPM-2B-128k
gguf: This GGUF file is for Little Endian only
Set model parameters
Set model tokenizer
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
gguf: Setting special token type bos to 1
gguf: Setting special token type eos to 2
gguf: Setting special token type unk to 0
gguf: Setting add_bos_token to True
gguf: Setting add_eos_token to False
gguf: Setting chat_template to {% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% for message in messages %}{{'<|im_start|>' + message['role'] + '
' + message['content'] + '<|im_end|>' + '
'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant
' }}{% endif %}
Exporting model to 'models/openbmb/MiniCPM-2B-128k/ggml-model-f16.gguf'
gguf: loading model part 'pytorch_model.bin'
/Users/gardner/miniconda3/envs/llama/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
token_embd.weight, n_dims = 2, torch.bfloat16 --> float16
output_norm.weight, n_dims = 1, torch.bfloat16 --> float32
Can not map tensor 'lm_head.weight'

It creates a zero-length file at models/openbmb/MiniCPM-2B-128k/ggml-model-f16.gguf

$ ls -lah models/openbmb/MiniCPM-2B-128k
total 11779952
drwxr-xr-x@ 14 gardner  staff   448B 14 Apr 23:05 .
drwxr-xr-x@  3 gardner  staff    96B 14 Apr 22:59 ..
-rw-r--r--@  1 gardner  staff   1.5K 14 Apr 22:57 .gitattributes
-rw-r--r--@  1 gardner  staff   7.2K 14 Apr 22:57 README.md
-rw-r--r--@  1 gardner  staff   168B 14 Apr 22:57 added_tokens.json
-rw-r--r--@  1 gardner  staff   1.1K 14 Apr 22:57 config.json
-rw-r--r--@  1 gardner  staff   9.7K 14 Apr 22:57 configuration_minicpm.py
-rw-r--r--@  1 gardner  staff     0B 14 Apr 23:05 ggml-model-f16.gguf
-rw-r--r--@  1 gardner  staff    66K 14 Apr 22:57 modeling_minicpm.py
-rw-r--r--@  1 gardner  staff   5.6G 14 Apr 22:59 pytorch_model.bin
-rw-r--r--@  1 gardner  staff   574B 14 Apr 22:59 special_tokens_map.json
-rw-r--r--@  1 gardner  staff   5.9M 14 Apr 22:59 tokenizer.json
-rw-r--r--@  1 gardner  staff   1.9M 14 Apr 22:59 tokenizer.model
-rw-r--r--@  1 gardner  staff   2.6K 14 Apr 22:59 tokenizer_config.json

gardner avatar Apr 14 '24 11:04 gardner

I am still getting this on Apple Silicon:

$ make clean ; git pull origin ; make -j $(nproc)
$ conda activate llama
$ python3 -m pip install -U -r requirements.txt

$ python3 convert-hf-to-gguf.py models/openbmb/MiniCPM-2B-128k/
Loading model: MiniCPM-2B-128k
gguf: This GGUF file is for Little Endian only
Set model parameters
Set model tokenizer
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
gguf: Setting special token type bos to 1
gguf: Setting special token type eos to 2
gguf: Setting special token type unk to 0
gguf: Setting add_bos_token to True
gguf: Setting add_eos_token to False
gguf: Setting chat_template to {% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% for message in messages %}{{'<|im_start|>' + message['role'] + '
' + message['content'] + '<|im_end|>' + '
'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant
' }}{% endif %}
Exporting model to 'models/openbmb/MiniCPM-2B-128k/ggml-model-f16.gguf'
gguf: loading model part 'pytorch_model.bin'
/Users/gardner/miniconda3/envs/llama/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
token_embd.weight, n_dims = 2, torch.bfloat16 --> float16
output_norm.weight, n_dims = 1, torch.bfloat16 --> float32
Can not map tensor 'lm_head.weight'

It creates a zero-length file at models/openbmb/MiniCPM-2B-128k/ggml-model-f16.gguf

$ ls -lah models/openbmb/MiniCPM-2B-128k
total 11779952
drwxr-xr-x@ 14 gardner  staff   448B 14 Apr 23:05 .
drwxr-xr-x@  3 gardner  staff    96B 14 Apr 22:59 ..
-rw-r--r--@  1 gardner  staff   1.5K 14 Apr 22:57 .gitattributes
-rw-r--r--@  1 gardner  staff   7.2K 14 Apr 22:57 README.md
-rw-r--r--@  1 gardner  staff   168B 14 Apr 22:57 added_tokens.json
-rw-r--r--@  1 gardner  staff   1.1K 14 Apr 22:57 config.json
-rw-r--r--@  1 gardner  staff   9.7K 14 Apr 22:57 configuration_minicpm.py
-rw-r--r--@  1 gardner  staff     0B 14 Apr 23:05 ggml-model-f16.gguf
-rw-r--r--@  1 gardner  staff    66K 14 Apr 22:57 modeling_minicpm.py
-rw-r--r--@  1 gardner  staff   5.6G 14 Apr 22:59 pytorch_model.bin
-rw-r--r--@  1 gardner  staff   574B 14 Apr 22:59 special_tokens_map.json
-rw-r--r--@  1 gardner  staff   5.9M 14 Apr 22:59 tokenizer.json
-rw-r--r--@  1 gardner  staff   1.9M 14 Apr 22:59 tokenizer.model
-rw-r--r--@  1 gardner  staff   2.6K 14 Apr 22:59 tokenizer_config.json

@gardner @flatsiedatsie The latest long context model couldn't be supported due to "removed tie_embedding and expanded the vocabulary to 127660". It could be solved by add some lines to process "MODEL_TENSOR.OUTPUT". However, it seems it's quite difficult to distinguish the new model from the older ones. @ShengdingHu any suggestions?

runfuture avatar Apr 14 '24 15:04 runfuture

Hey, you can have a try with ChatLLM.cpp, 2B, 1B and MoE models are all supported. 😊

foldl avatar Apr 15 '24 08:04 foldl

@fodl thanks for the suggestion, but unfortunately I'm relying on llamap-cpp-wasm / wllama to run these models 100% in the browser. ChatLLM.cpp does not seem to support that use case at the moment?

flatsiedatsie avatar Apr 23 '24 08:04 flatsiedatsie