CogVLM2 Finetuning example loss does not decrease over time

System Info / 系統信息

CUDA 12.1, Torch 2.3

Who can help? / 谁可以帮助到您？

No response

Information / 问题信息

[ ] The official example scripts / 官方的示例脚本
[ ] My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

I've been trying to run the finetuning example for Cog2 (on a dual A100 setup with deepspeed). The loss fluctuates like crazy around 2.0, and it doesn't seem to converge or decrease over time. At first I tried with my own dataset, but as that didn't work also tested the code with the provided labels_en both from the multi-conversation and single conversation, and it does mostly the same (the en version fluctuates maybe around 1.8, up and down).

Tried to decrease/increase lr, no luck. Any suggestions?

BTW, thanks for releasing this great model!

Expected behavior / 期待表现

Loss decreases.

loss

May 23 '24 14:05 hvico

Just in case, I tried the same finetuning using 8 x A100 to mimic the docs, loss doesn't seem to decrease either.

This is my tensorboard for the 8 x GPU finetuning right now, all params default:

loss2

May 23 '24 17:05 hvico

me too, loss does not decrease too, my target : ["gate_proj","up_proj","down_proj","vision_expert_query_key_value","vision_expert_dense","language_expert_query_key_value","language_expert_dense"]