CogVLM2
CogVLM2 copied to clipboard
Finetuning example loss does not decrease over time
System Info / 系統信息
CUDA 12.1, Torch 2.3
Who can help? / 谁可以帮助到您?
No response
Information / 问题信息
- [ ] The official example scripts / 官方的示例脚本
- [ ] My own modified scripts / 我自己修改的脚本和任务
Reproduction / 复现过程
I've been trying to run the finetuning example for Cog2 (on a dual A100 setup with deepspeed). The loss fluctuates like crazy around 2.0, and it doesn't seem to converge or decrease over time. At first I tried with my own dataset, but as that didn't work also tested the code with the provided labels_en both from the multi-conversation and single conversation, and it does mostly the same (the en version fluctuates maybe around 1.8, up and down).
Tried to decrease/increase lr, no luck. Any suggestions?
BTW, thanks for releasing this great model!
Expected behavior / 期待表现
Loss decreases.
Just in case, I tried the same finetuning using 8 x A100 to mimic the docs, loss doesn't seem to decrease either.
This is my tensorboard for the 8 x GPU finetuning right now, all params default:
me too, loss does not decrease too, my target : ["gate_proj","up_proj","down_proj","vision_expert_query_key_value","vision_expert_dense","language_expert_query_key_value","language_expert_dense"]
loss确实没有降低,这个问题已经被记录,我们测试了微调的模型能按照微调的结果输出(但是loss没有下降,还在进一步debug)
初步猜测是lr太低的问题,调成1e-4就好些了,有下降的趋势,但也不明显,LLM一般很明显
初步猜测是lr太低的问题,调成1e-4就好些了,有下降的趋势,但也不明显,LLM一般很明显
1e-4 下降0.04每个epoch, 2e-7下降0.01每epoch(太小了,所以肉眼看起来是没下降)
可以尝试用sat微调:https://github.com/THUDM/CogVLM/tree/cogvlm2_dev ,把微调代码的CogVLMModel换成CogVLM2Model就可以了。后续会merge到这个仓库里。
可以尝试用sat微调:https://github.com/THUDM/CogVLM/tree/cogvlm2_dev ,把微调代码的CogVLMModel换成CogVLM2Model就可以了。后续会merge到这个仓库里。
merge了吗?现在还是波动较大
loss确实没有降低,这个问题已经被记录,我们测试了微调的模型能按照微调的结果输出(但是loss没有下降,还在进一步debug)
您好,请问这个大概是什么原因呢?现在训loss还是不太下降
