mmdetection 在使用resume文件继续训练时卡在了advance dataloader步骤中

在使用resume文件继续训练时卡在了advance dataloader步骤中

Open Chengnotwang opened this issue 11 months ago • 13 comments

你好，最近我在MMdetection训练时服务器意外重启，所以想用resume继续训练遇到了这个情况，请问这要怎么解决

Mar 16 '24 06:03 Chengnotwang

excuse me did you solve this problem？

Apr 09 '24 03:04 XLBL2333

problem

Not yet

Apr 10 '24 12:04 Chengnotwang

@Chengnotwang 兄弟我今天也遇到这个问题了，估计是他们库的bug，服务器重启之后，resume那个pth就卡在这一步了

May 16 '24 13:05 nyjshinibaba

@chhluo @Chengnotwang 据我仔细阅读代码，应该是保存的原因，resume只能支持epoch保存的pth，而不支持iter保存的权重，将checkpointhook的初始化参数加上by_epoch=True,这样就是以iter训练，但是保存权重是按照epoch保存的，这个应该可以resume，但是iter因为无法记录最后的iter所在的epoch已经训练过哪些batch包含的数据，或者说这是个bug还没解决，所以iter.pth在resume的时候会因为无法完整加载或者识别当前epoch需要的剩余数据而卡住

May 17 '24 12:05 nyjshinibaba

@nyjshinibaba 感谢兄弟的分享，确实我在后续使用中by_epoch训练保存的checkpoint可以resume，这应该能佐证你的观点

May 17 '24 12:05 Chengnotwang

@Chengnotwang 谢谢

May 17 '24 12:05 nyjshinibaba

我debug了好几天才发现，是因为新版的mmengine的问题。Try this:

mim install mmengine==0.10.2

May 22 '24 02:05 ShenZheng2000

谢谢，有空我会去试试

------------------ 原始邮件 ------------------ 发件人: @.>; 发送时间: 2024年5月22日(星期三) 上午10:49 收件人: @.>; 抄送: @.>; @.>; 主题: Re: [open-mmlab/mmdetection] 在使用resume文件继续训练时卡在了advance dataloader步骤中 (Issue #11556)

我debug了好几天才发现，是因为新版的mmengine的问题。Try this: mim install mmengine==0.10.2
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

May 22 '24 02:05 Chengnotwang

@ShenZheng2000 请问你debug的前提是模型保存的是iter而不是epoch么

May 23 '24 08:05 nyjshinibaba

@nyjshinibaba Yes. 保存的是Iter

May 23 '24 15:05 ShenZheng2000

@ShenZheng2000 我改成这个版本的mmengine之后报错loaded state dict has a different number of parameter groups

May 30 '24 03:05 nyjshinibaba

@ShenZheng2000 请问你改完mmengine版本之后是重新训了还是直接拿之前保存的iter来resume呢

May 30 '24 03:05 nyjshinibaba

@nyjshinibaba 我是resume的

May 30 '24 05:05 ShenZheng2000

mmdetection mmdetection copied to clipboard

在使用resume文件继续训练时卡在了advance dataloader步骤中

mmdetection
mmdetection copied to clipboard