CogVLM2 icon indicating copy to clipboard operation
CogVLM2 copied to clipboard

Finetuning example loss does not decrease over time

Open hvico opened this issue 1 year ago • 8 comments

System Info / 系統信息

CUDA 12.1, Torch 2.3

Who can help? / 谁可以帮助到您?

No response

Information / 问题信息

  • [ ] The official example scripts / 官方的示例脚本
  • [ ] My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

I've been trying to run the finetuning example for Cog2 (on a dual A100 setup with deepspeed). The loss fluctuates like crazy around 2.0, and it doesn't seem to converge or decrease over time. At first I tried with my own dataset, but as that didn't work also tested the code with the provided labels_en both from the multi-conversation and single conversation, and it does mostly the same (the en version fluctuates maybe around 1.8, up and down).

Tried to decrease/increase lr, no luck. Any suggestions?

BTW, thanks for releasing this great model!

Expected behavior / 期待表现

Loss decreases.

loss

hvico avatar May 23 '24 14:05 hvico

Just in case, I tried the same finetuning using 8 x A100 to mimic the docs, loss doesn't seem to decrease either.

This is my tensorboard for the 8 x GPU finetuning right now, all params default:

loss2

hvico avatar May 23 '24 17:05 hvico

me too, loss does not decrease too, my target : ["gate_proj","up_proj","down_proj","vision_expert_query_key_value","vision_expert_dense","language_expert_query_key_value","language_expert_dense"]

BruceJust avatar May 28 '24 02:05 BruceJust

loss确实没有降低,这个问题已经被记录,我们测试了微调的模型能按照微调的结果输出(但是loss没有下降,还在进一步debug)

zRzRzRzRzRzRzR avatar May 28 '24 03:05 zRzRzRzRzRzRzR

初步猜测是lr太低的问题,调成1e-4就好些了,有下降的趋势,但也不明显,LLM一般很明显 企业微信截图_0d1442f4-1209-44fb-b743-13be1a16e005

BruceJust avatar May 28 '24 09:05 BruceJust

初步猜测是lr太低的问题,调成1e-4就好些了,有下降的趋势,但也不明显,LLM一般很明显 企业微信截图_0d1442f4-1209-44fb-b743-13be1a16e005

1e-4 下降0.04每个epoch, 2e-7下降0.01每epoch(太小了,所以肉眼看起来是没下降)

BruceJust avatar May 28 '24 09:05 BruceJust

可以尝试用sat微调:https://github.com/THUDM/CogVLM/tree/cogvlm2_dev ,把微调代码的CogVLMModel换成CogVLM2Model就可以了。后续会merge到这个仓库里。

1049451037 avatar May 28 '24 12:05 1049451037

可以尝试用sat微调:https://github.com/THUDM/CogVLM/tree/cogvlm2_dev ,把微调代码的CogVLMModel换成CogVLM2Model就可以了。后续会merge到这个仓库里。

merge了吗?现在还是波动较大

lk137095576 avatar Jun 14 '24 05:06 lk137095576

loss确实没有降低,这个问题已经被记录,我们测试了微调的模型能按照微调的结果输出(但是loss没有下降,还在进一步debug)

您好,请问这个大概是什么原因呢?现在训loss还是不太下降

Helens1997 avatar Jul 03 '24 12:07 Helens1997