Resume from model Error
When I try to resume training at 17 epochs use:
bash tools/dist_train.sh config_file
But It loads the 7 epoch. Errors are the followings:
2022-08-07 01:03:00,220 - mmdet - INFO - load checkpoint from ./work_dirs/res50_cam_radar/epoch_17.pth
INFO:mmdet:load checkpoint from ./work_dirs/res50_cam_radar/epoch_17.pth
2022-08-07 01:03:00,220 - mmdet - INFO - Use load_from_local loader
INFO:mmdet:Use load_from_local loader
2022-08-07 01:03:00,656 - mmdet - INFO - resumed epoch 7, iter 49231
INFO:mmdet:resumed epoch 7, iter 49231
2022-08-07 01:03:00,658 - mmdet - INFO - Start running, host: root@container-49581189ae-608a3c1f, work_dir: /root/autodl-tmp/SparseFusion3D/work_dirs/res50_cam_radar
INFO:mmdet:Start running, host: root@container-49581189ae-608a3c1f, work_dir: /root/autodl-tmp/SparseFusion3D/work_dirs/res50_cam_radar
2022-08-07 01:03:00,658 - mmdet - INFO - Hooks will be executed in the following order:
In train.py setting:
parser = argparse.ArgumentParser(description='Train a detector')
parser.add_argument('config', help='train config file path')
parser.add_argument('--work-dir', help='the dir to save logs and models')
parser.add_argument(
'--resume-from', default='./work_dirs/res50_cam_radar/epoch_17.pth', help='the checkpoint file to resume from')
parser.add_argument(
'--no-validate',
action='store_true',
help='whether not to evaluate the checkpoint during training')
group_gpus = parser.add_mutually_exclusive_group()
How should solve it to reload epoch_17?
Please provide the information generated by python mmdet/utils/collect_env.py.
In lastest version mmdet v2.25.1, auto-resume is suupported, see https://github.com/open-mmlab/mmdetection/blob/3b72b12fe9b14de906d1363982b9fba05e7d47c1/tools/train.py#L32.
@chhluo Envs:
TorchVision: 0.9.0
OpenCV: 4.5.5
MMCV: 1.3.8
MMCV Compiler: GCC 9.4
MMCV CUDA Compiler: 11.1
MMDetection: 2.14.0
MMSegmentation: 0.14.1
MMDetection3D: 0.17.0+34a4767
And then I tried to print meta:
model = torch.load('epoch_17.pth')
print(model['meta']['epoch'])
It also was:
7
Thanks for your reply
@nacayu That's strange. Please try to load other checkpoints.
@chhluo Also the same problem as before when I tried to load epoch_18 and so on. I do not know whether the same as https://github.com/open-mmlab/mmcv/pull/1105
See https://github.com/open-mmlab/mmcv/pull/1108. https://github.com/open-mmlab/mmdetection/issues/5505 Please update mmcv to v1.3.18.