CogVLM 使用 fp16 训练，merge lora 之后的模型推理结果异常

trafficstars

System Info / 系統信息

版本及硬件按照指示安装

Who can help? / 谁可以帮助到您？

@1049451037

Information / 问题信息

[X] The official example scripts / 官方的示例脚本
[ ] My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

通过传入 --fp16 使用 fp16 lora 训练
使用 finetune_cogvlm_demo.py 未merge lora 模型进行推理可以获得正确的结果
使用 merge lora 模型推理结果异常

Expected behavior / 期待表现

怀疑是 fp16 训练的模型，merge 过程中存在bug，能不能帮忙定位一下问题

Apr 16 '24 07:04 GondorFu

异常报错截图 fp16 bf16都可以随时相互转换吧，应该不是数据类型的问题

Apr 24 '24 09:04 elesun2018

异常报错截图 fp16 bf16都可以随时相互转换吧，应该不是数据类型的问题

没有报错，是推理结果不对，没merge结果是对的，但是merge完推理的结果都是[][][][][][]...

Apr 26 '24 04:04 GondorFu

training_main(args, model_cls=model, forward_step_function=forward_step, create_dataset_function=partial(create_dataset_function, image_processor, text_processor), handle_metrics_function=handle_metrics_function, collate_fn=data_collator, forward_step_eval=forward_step_eval)

if args.use_lora:
    model.get_mixin("lora").merge_lora()
    model.get_mixin("eva").vit_model.get_mixin("lora").merge_lora()
    args.use_lora = False

training_main(args, model_cls=model, forward_step_function=forward_step, create_dataset_function=partial(create_dataset_function, image_processor, text_processor), handle_metrics_function=handle_metrics_function, collate_fn=data_collator, forward_step_eval=forward_step_eval)

两个都能正常输出结果，但是上面的结果是正确的，但是下面的结果就是错的？请问一下是什么原因

Apr 30 '24 07:04 GondorFu

Abnormal error screenshot fp16 bf16 can be converted to each other at any time, it should not be a problem of data type

Afaik, It can be a problem, due to bf16 having a higher range but lower precision.

Jul 01 '24 01:07 JBurtn

Was this ever solved? Also running into this error when trying to just reproduce the CogAgent finetuning results from the official example scripts.

During fine-tuning (finetune_cogagent_demo.py), the predictions are correct, but the merged model has wrong predictions that are completely off during evaluation (merge_model.py and evaluate_cogagent_demo.py).

Aug 03 '24 00:08 KevinH48264

CogVLM CogVLM copied to clipboard

使用 fp16 训练，merge lora 之后的模型推理结果异常

System Info / 系統信息

Who can help? / 谁可以帮助到您？

Information / 问题信息

Reproduction / 复现过程

Expected behavior / 期待表现

CogVLM
CogVLM copied to clipboard