CogVLM icon indicating copy to clipboard operation
CogVLM copied to clipboard

使用 fp16 训练,merge lora 之后的模型推理结果异常

Open GondorFu opened this issue 1 year ago • 5 comments
trafficstars

System Info / 系統信息

版本及硬件按照指示安装

Who can help? / 谁可以帮助到您?

@1049451037

Information / 问题信息

  • [X] The official example scripts / 官方的示例脚本
  • [ ] My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

  1. 通过传入 --fp16 使用 fp16 lora 训练
  2. 使用 finetune_cogvlm_demo.py 未merge lora 模型进行推理可以获得正确的结果
  3. 使用 merge lora 模型推理结果异常

Expected behavior / 期待表现

怀疑是 fp16 训练的模型,merge 过程中存在bug,能不能帮忙定位一下问题

GondorFu avatar Apr 16 '24 07:04 GondorFu

异常报错截图 fp16 bf16都可以随时相互转换吧,应该不是数据类型的问题

elesun2018 avatar Apr 24 '24 09:04 elesun2018

异常报错截图 fp16 bf16都可以随时相互转换吧,应该不是数据类型的问题

没有报错,是推理结果不对,没merge结果是对的,但是merge完推理的结果都是[][][][][][]...

GondorFu avatar Apr 26 '24 04:04 GondorFu

training_main(args, model_cls=model, forward_step_function=forward_step, create_dataset_function=partial(create_dataset_function, image_processor, text_processor), handle_metrics_function=handle_metrics_function, collate_fn=data_collator, forward_step_eval=forward_step_eval)

if args.use_lora:
    model.get_mixin("lora").merge_lora()
    model.get_mixin("eva").vit_model.get_mixin("lora").merge_lora()
    args.use_lora = False

training_main(args, model_cls=model, forward_step_function=forward_step, create_dataset_function=partial(create_dataset_function, image_processor, text_processor), handle_metrics_function=handle_metrics_function, collate_fn=data_collator, forward_step_eval=forward_step_eval)

两个都能正常输出结果,但是上面的结果是正确的,但是下面的结果就是错的?请问一下是什么原因

GondorFu avatar Apr 30 '24 07:04 GondorFu

Abnormal error screenshot fp16 bf16 can be converted to each other at any time, it should not be a problem of data type

Afaik, It can be a problem, due to bf16 having a higher range but lower precision.

JBurtn avatar Jul 01 '24 01:07 JBurtn

Was this ever solved? Also running into this error when trying to just reproduce the CogAgent finetuning results from the official example scripts.

During fine-tuning (finetune_cogagent_demo.py), the predictions are correct, but the merged model has wrong predictions that are completely off during evaluation (merge_model.py and evaluate_cogagent_demo.py).

KevinH48264 avatar Aug 03 '24 00:08 KevinH48264