MiniCPM-V How to organize tuning dataset to control output in finetune

Hi, thanks for the great model release. Sorry if it's beginner question. I am testing one document image, and when I asked "what is document type" in base model, it returns "Visa Document US Califonia". It follows OCR result. Also I want to get different date format from OCR result in output.

I organized tuning dataset with same image and same question, different answer like "US California Visa Document". I bulit some small dataset with more images and Q&As. I lora-tuned 3k steps, and tried again with same question, but still same answer of base model.

How can I organize tuning dataset to control outputs? And also if possible, how can I verify if model is fine tuned (to check difference from base model)? Looking forward to your kind assistance. Thanks again for great model.

May 31 '24 14:05 zhu-j-faceonlive

Here are some tricks that can help you achieve the best results. First, customize your hyperparameters by increasing the alpha in LoRA. If your data has already been processed and can generate responses, you should consider lowering the rank. The goal is to increase the alpha in LoRA to penalize certain weights in the base model. These are the strategies I follow to achieve optimal results in my context.

Jun 01 '24 14:06 WAILMAGHRANE

It seems a bit strange, can you share your training parameters such as batch size and learning rate? Normally, if you observe that your training loss is steadily declining, the behavior of the model after training should most likely change.

Jun 02 '24 16:06 YuzaChongyi

Thanks for the update. What's the minium size (QA pairs) of test dataset to get some effect?

Jun 03 '24 02:06 zhu-j-faceonlive

Thanks for the update. What's the minium size (QA pairs) of test dataset to get some effect?

Normally, for general tasks, thousands of QA pairs with 3 epochs is enough. But be careful that your batch size can't be too small, it should be large than 64 at least.

Jun 03 '24 05:06 YuzaChongyi

looks like I'm facing the same problem, I tried to use finetune_lora.sh to finetune a task with text and image as inputs, then output a short text which is classification label. during the training, train/val loss decreased as expected, val loss started to increase at some checkpoint which could imply the overfitting.

Then I tried to load the model by calling AutoPeftModelForCausalLM.from_pretrained with the best checkpoint, and inference with model.chat with image, msgs, tokenizer (the same tokenizer used for finetuning).

The output does not follow the format in my training dataset. not sure if I did wrong in some steps.

Jun 04 '24 08:06 strawhatboy

Is the tokenizer set correctly during inference? You should use original tokenizer or change the chat_template.

Jun 04 '24 09:06 YuzaChongyi

Is the tokenizer set correctly during inference? You should use original tokenizer or change the chat_template.

I'm not sure.. I was finetuning MiniCPM-V-2, so the tokenizer during inference is

tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2', trust_remote_code=True)

maybe I should load the tokenizer from the checkpoint? like MiniCPM-V/finetune/output/output_minicpmv2_lora/checkpoint-3800

Jun 04 '24 10:06 strawhatboy

Is the tokenizer set correctly during inference? You should use original tokenizer or change the chat_template.

I'm not sure.. I was finetuning MiniCPM-V-2, so the tokenizer during inference is
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2', trust_remote_code=True)
maybe I should load the tokenizer from the checkpoint? like MiniCPM-V/finetune/output/output_minicpmv2_lora/checkpoint-3800

MiniCPM-V-2 does not need to change tokenizer, so your current usage is fine. By the way, why not finetune MiniCPM-V-2 with all parameters? V2 is a relatively small 2B model

Jun 04 '24 10:06 YuzaChongyi

Is the tokenizer set correctly during inference? You should use original tokenizer or change the chat_template.

I'm not sure.. I was finetuning MiniCPM-V-2, so the tokenizer during inference is
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2', trust_remote_code=True)
maybe I should load the tokenizer from the checkpoint? like MiniCPM-V/finetune/output/output_minicpmv2_lora/checkpoint-3800
MiniCPM-V-2 does not need to change tokenizer, so your current usage is fine. By the way, why not finetune MiniCPM-V-2 with all parameters? V2 is a relatively small 2B model

yeah, the result looks bad. also tried to lora-finetune V2.5, seems the model only outputs the label with highest quantity.. will try to finetune with all params, many thanks!

Jun 05 '24 02:06 strawhatboy

Is the tokenizer set correctly during inference? You should use original tokenizer or change the chat_template.

I'm not sure.. I was finetuning MiniCPM-V-2, so the tokenizer during inference is
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2', trust_remote_code=True)
maybe I should load the tokenizer from the checkpoint? like MiniCPM-V/finetune/output/output_minicpmv2_lora/checkpoint-3800
MiniCPM-V-2 does not need to change tokenizer, so your current usage is fine. By the way, why not finetune MiniCPM-V-2 with all parameters? V2 is a relatively small 2B model
yeah, the result looks bad. also tried to lora-finetune V2.5, seems the model only outputs the label with highest quantity.. will try to finetune with all params, many thanks! I also encountered this problem, I think it is the problem of chat_template, but I do not know how to solve it. If there is any progress, would you please inform me

Jun 05 '24 03:06 suanfaxiaohuo

It will be great if there's minimal tuning dataset and it's output after finetune. After confirming on that test dataset, we can just replace our own tuning dataset.

Jun 05 '24 03:06 zhu-j-faceonlive

Hi, team. I fine tuned with 15 <image, Q&A> pairs of one document type. (12 pairs as train, 3 as test) After 500 epoches, both of training and eval_loss are nearly zero like following.

{'loss': 0.0, 'grad_norm': 7.342097887885757e-06, 'learning_rate': 1e-06, 'epoch': 500.0}
{'eval_loss': 7.8905128475526e-07, 'eval_runtime': 2.179, 'eval_samples_per_second': 1.377, 'eval_steps_per_second': 0.459, 'epoch': 500.0}

But even I pick up image and ask same questions from training set, still get different result from training set like following

Training Data

{
      "role": "user",
      "content": "What is the address breakdown city?"
    },
    {
      "role": "assistant",
      "content": "CALWELL"
    },
    {
      "role": "user",
      "content": "What is the address breakdown street?"
    },
    {
      "role": "assistant",
      "content": "65 MARTLEY CCT"
    },
    {
      "role": "user",
      "content": "What is the address?"
    },
    {
      "role": "assistant",
      "content": "65 MARTLEY CCT, CALWELL, ACT, 2905"
    },

Model Output

<User>: What is the address breakdown city?
<Assistant>: 65 MARTLEY CCT
<User>: What is the address breakdown street?
<Assistant>: 65 MARTLEY CCT, suburb CALWELL ACT 2905.
<User>: What is the address?
<Assistant>: 65 MARTLEY CCT CALWELL ACT 2905

I used batch_size 8, and model_max_length 1024. And finetuned only resampler like following due to memory limit.

    if training_args.tune_vision:
        model.base_model.vpm.requires_grad_(False)
        model.base_model.resampler.requires_grad_(True)

I loaded model like following.

model_path = '/home/paperspace/.../OpenBmb/MiniCPM-V/finetune/output/output_minicpmv2_lora/checkpoint-1000'
model = AutoPeftModelForCausalLM.from_pretrained(model_path, device_map="auto", trust_remote_code=True).to(dtype=torch.float16)
model = model.to(device=device)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model.eval()

Tensorboard

Any help would be appreciated to get correct finetuning result.

Jun 06 '24 00:06 EmailScraper

Is the tokenizer set correctly during inference? You should use original tokenizer or change the chat_template.

I'm not sure.. I was finetuning MiniCPM-V-2, so the tokenizer during inference is
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2', trust_remote_code=True)
maybe I should load the tokenizer from the checkpoint? like MiniCPM-V/finetune/output/output_minicpmv2_lora/checkpoint-3800
MiniCPM-V-2 does not need to change tokenizer, so your current usage is fine. By the way, why not finetune MiniCPM-V-2 with all parameters? V2 is a relatively small 2B model
yeah, the result looks bad. also tried to lora-finetune V2.5, seems the model only outputs the label with highest quantity.. will try to finetune with all params, many thanks! I also encountered this problem, I think it is the problem of chat_template, but I do not know how to solve it. If there is any progress, would you please inform me

With new inference code including loading of vpm_resampler_embedtokens_weight, the instruction following is good now, MiniCPMv2.5 lora finetuned model is able to output only the classification labels instead of a long paragraph. Even the classification accuracy is poor... maybe need to improve the data quality or use multi-round conversation to include intermediate steps.

Jun 13 '24 02:06 strawhatboy

Hi, team. I fine tuned with 15 <image, Q&A> pairs of one document type. (12 pairs as train, 3 as test) After 500 epoches, both of training and eval_loss are nearly zero like following.

{'loss': 0.0, 'grad_norm': 7.342097887885757e-06, 'learning_rate': 1e-06, 'epoch': 500.0}
{'eval_loss': 7.8905128475526e-07, 'eval_runtime': 2.179, 'eval_samples_per_second': 1.377, 'eval_steps_per_second': 0.459, 'epoch': 500.0}

But even I pick up image and ask same questions from training set, still get different result from training set like following

Training Data

{
      "role": "user",
      "content": "What is the address breakdown city?"
    },
    {
      "role": "assistant",
      "content": "CALWELL"
    },
    {
      "role": "user",
      "content": "What is the address breakdown street?"
    },
    {
      "role": "assistant",
      "content": "65 MARTLEY CCT"
    },
    {
      "role": "user",
      "content": "What is the address?"
    },
    {
      "role": "assistant",
      "content": "65 MARTLEY CCT, CALWELL, ACT, 2905"
    },

Model Output

<User>: What is the address breakdown city?
<Assistant>: 65 MARTLEY CCT
<User>: What is the address breakdown street?
<Assistant>: 65 MARTLEY CCT, suburb CALWELL ACT 2905.
<User>: What is the address?
<Assistant>: 65 MARTLEY CCT CALWELL ACT 2905

I used batch_size 8, and model_max_length 1024. And finetuned only resampler like following due to memory limit.

    if training_args.tune_vision:
        model.base_model.vpm.requires_grad_(False)
        model.base_model.resampler.requires_grad_(True)

I loaded model like following.

model_path = '/home/paperspace/.../OpenBmb/MiniCPM-V/finetune/output/output_minicpmv2_lora/checkpoint-1000'
model = AutoPeftModelForCausalLM.from_pretrained(model_path, device_map="auto", trust_remote_code=True).to(dtype=torch.float16)
model = model.to(device=device)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model.eval()

Tensorboard

Any help would be appreciated to get correct finetuning result.

Hello, can you tell me how you called the trained model? I keep getting an error when I call a trained model. Thank you!

code as follow:

from peft import AutoPeftModelForCausalLM

model_path = '/public/home/···/openbmb/MiniCPM-V/finetune/output/output_minicpmv2/checkpoint-10000'
model = AutoPeftModelForCausalLM.from_pretrained(model_path, device_map="auto", trust_remote_code=True ).to(dtype=torch.float16)
model = model.to('cuda')
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model.eval()

error as follows：

Traceback (most recent call last): File "/public/home/···/miniconda3/envs/test9/lib/python3.10/site-packages/peft/config.py", line 143, in from_pretrained config_file = hf_hub_download( File "/public/home/···/miniconda3/envs/test9/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn validate_repo_id(arg_value) File "/public/home/···/miniconda3/envs/test9/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id raise HFValidationError( huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/public/home/···/openbmb/MiniCPM-V/finetune/output/output_minicpmv2/checkpoint-10000'. Use repo_type argument if needed.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/public/home/···/openbmb/invoke_model.py", line 176, in model = AutoPeftModelForCausalLM.from_pretrained(model_path, device_map="auto", trust_remote_code=True , local_files_only=True).to(dtype=torch.float16) File "/public/home/···/miniconda3/envs/test9/lib/python3.10/site-packages/peft/auto.py", line 72, in from_pretrained peft_config = PeftConfig.from_pretrained(pretrained_model_name_or_path, **kwargs) File "/public/home/···/miniconda3/envs/test9/lib/python3.10/site-packages/peft/config.py", line 147, in from_pretrained raise ValueError(f"Can't find '{CONFIG_NAME}' at '{pretrained_model_name_or_path}'") ValueError: Can't find 'adapter_config.json' at '/public/home/···/openbmb/MiniCPM-V/finetune/output/output_minicpmv2/checkpoint-10000'

Jun 28 '24 06:06 limllzu

st_remote_code=True

你是使用的lora微调么，感觉你用的是全量微调，

Jul 04 '24 09:07 LDLINGLINGLING

您好我们这里更新了lora的微调和加载方式，可以重新尝试一下

Jul 16 '24 04:07 qyc-98