MiniCPM-V icon indicating copy to clipboard operation
MiniCPM-V copied to clipboard

[BUG] <title>

Open 66RomanReigns opened this issue 1 year ago • 1 comments

是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?

  • [X] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?

  • [X] 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

使用miniCPM-V-2.6进行全参数sft微调,前面只有单图的数据集,训练正常,有损失,有梯度,当遇到多图的数据集训练卡住,最后因为超时结束训练。 卡住的自定义数据集的格式为: { "id": "133", "image": { "<image_00>": "2154606e17276278140924117d102f-0.jpg", "<image_01>": "2154606e17276278140924117d102f-1.jpg" }, "conversations": [ { "role": "user", "content": "<用户与客服的对话 START>\n用户: <image_00>\n客服: 亲爱的,我在这里哦,只是一张图片还不能解答您的疑问,麻烦您用文字具体说明一下,比如“发货时间”之类的。\n用户: 您好,这套目前缺货了吗?\n客服: 亲爱的,页面上能够添加到购物车并付款的商品通常都有库存,加不了的就是暂时缺货了,我们会不定期补货,如果您喜欢的话可以先收藏起来。同时,您也可以看看其他相似款式哦~\n用户: <image_01>\n客服: 亲爱的,我在这里哦,只是一张图片还不能解答您的疑问,麻烦您用文字具体说明一下,比如“发货时间”之类的。\n用户: 您好,这套服装的设计真的非常吸引我,如果有补货的话请告知一声,谢谢!\n客服: 亲爱的,这个还不太确定呢,如果喜欢的话建议先加个收藏,也可以看看其他款式有没有中意的哦~\n<用户与客服的对话 END>\n请直接只输出分类标签结果,不需要其他多余的话。以下是可以参考的分类标签为:["反馈密封性不好","是否好用","是否会生锈","排水方式","包装区别","发货数量","反馈用后症状","商品材质","功效功能","是否易褪色","适用季节","能否调光","版本款型区别","单品推荐","用法用量","控制方式","上市时间","商品规格","信号情况","养护方法","套装推荐","何时上货","气泡"]\n" }, { "role": "assistant", "content": "何时上货" } ] },

{'loss': 0.7426, 'grad_norm': 32.238158212312115, 'learning_rate': 0.0, 'epoch': 0.03}
{'loss': 0.4433, 'grad_norm': 20.96614904500204, 'learning_rate': 8.859191006777896e-07, 'epoch': 0.06}
{'loss': 0.5118, 'grad_norm': 23.445825750005728, 'learning_rate': 1.4041485532469074e-06, 'epoch': 0.1}
{'loss': 0.672, 'grad_norm': 29.751098652216083, 'learning_rate': 1.7718382013555792e-06, 'epoch': 0.13}
0%|▍ | 4/1000 [01:20<5:21:31, 19.37s/it][rank0]:[E1209 23:20:30.131444904 ProcessGroupNCCL.cpp:616] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4851, OpType=_REDUCE_SCATTER_BASE, NumelIn=543570944, NumelOut=67946368, Timeout(ms)=600000) ran for 600012 milliseconds before timing out. [rank0]:[E1209 23:20:30.131589559 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 0] Exception (either an error or timeout) detected by watchdog at work: 4851, last enqueued NCCL work: 4853, last completed NCCL work: 4850. [rank3]:[E1209 23:20:30.135955776 ProcessGroupNCCL.cpp:616] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4851, OpType=_REDUCE_SCATTER_BASE, NumelIn=543570944, NumelOut=67946368, Timeout(ms)=600000) ran for 600020 milliseconds before timing out. [rank3]:[E1209 23:20:30.136100032 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 3] Exception (either an error or timeout) detected by watchdog at work: 4851, last enqueued NCCL work: 4853, last completed NCCL work: 4850. [rank2]:[E1209 23:20:30.143354456 ProcessGroupNCCL.cpp:616] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4851, OpType=_REDUCE_SCATTER_BASE, NumelIn=543570944, NumelOut=67946368, Timeout(ms)=600000) ran for 600031 milliseconds before timing out. [rank2]:[E1209 23:20:30.143503353 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 2] Exception (either an error or timeout) detected by watchdog at work: 4851, last enqueued NCCL work: 4853, last completed NCCL work: 4850. [rank7]:[E1209 23:20:30.152424922 ProcessGroupNCCL.cpp:616] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4851, OpType=_REDUCE_SCATTER_BASE, NumelIn=543570944, NumelOut=67946368, Timeout(ms)=600000) ran for 600034 milliseconds before timing out. [rank7]:[E1209 23:20:30.152559211 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 7] Exception (either an error or timeout) detected by watchdog at work: 4851, last enqueued NCCL work: 4853, last completed NCCL work: 4850. [rank6]:[E1209 23:20:30.187274433 ProcessGroupNCCL.cpp:616] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4851, OpType=_REDUCE_SCATTER_BASE, NumelIn=543570944, NumelOut=67946368, Timeout(ms)=600000) ran for 600068 milliseconds before timing out. [rank6]:[E1209 23:20:30.187397135 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 6] Exception (either an error or timeout) detected by watchdog at work: 4851, last enqueued NCCL work: 4853, last completed NCCL work: 4850. [rank5]:[E1209 23:20:30.194850107 ProcessGroupNCCL.cpp:616] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4851, OpType=_REDUCE_SCATTER_BASE, NumelIn=543570944, NumelOut=67946368, Timeout(ms)=600000) ran for 600077 milliseconds before timing out. [rank5]:[E1209 23:20:30.194980756 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 5] Exception (either an error or timeout) detected by watchdog at work: 4851, last enqueued NCCL work: 4853, last completed NCCL work: 4850. .................

期望行为 | Expected Behavior

No response

复现方法 | Steps To Reproduce

No response

运行环境 | Environment

- OS:ubuntu 22.04
- Python: 3.10
- Transformers: 4.45.1
- PyTorch: 2.5.1
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):12.4

备注 | Anything else?

No response

66RomanReigns avatar Dec 09 '24 17:12 66RomanReigns

请问解决了吗,遇到了类似问题

Zhangyh056 avatar Dec 27 '24 03:12 Zhangyh056

try to use our new model~

Cuiunbo avatar Feb 06 '25 18:02 Cuiunbo