mmdetection icon indicating copy to clipboard operation
mmdetection copied to clipboard

Resume from model Error

Open nacayu opened this issue 3 years ago • 5 comments

When I try to resume training at 17 epochs use:

bash tools/dist_train.sh config_file

But It loads the 7 epoch. Errors are the followings:

2022-08-07 01:03:00,220 - mmdet - INFO - load checkpoint from ./work_dirs/res50_cam_radar/epoch_17.pth
INFO:mmdet:load checkpoint from ./work_dirs/res50_cam_radar/epoch_17.pth
2022-08-07 01:03:00,220 - mmdet - INFO - Use load_from_local loader
INFO:mmdet:Use load_from_local loader
2022-08-07 01:03:00,656 - mmdet - INFO - resumed epoch 7, iter 49231
INFO:mmdet:resumed epoch 7, iter 49231
2022-08-07 01:03:00,658 - mmdet - INFO - Start running, host: root@container-49581189ae-608a3c1f, work_dir: /root/autodl-tmp/SparseFusion3D/work_dirs/res50_cam_radar
INFO:mmdet:Start running, host: root@container-49581189ae-608a3c1f, work_dir: /root/autodl-tmp/SparseFusion3D/work_dirs/res50_cam_radar
2022-08-07 01:03:00,658 - mmdet - INFO - Hooks will be executed in the following order:

In train.py setting:

    parser = argparse.ArgumentParser(description='Train a detector')
    parser.add_argument('config', help='train config file path')
    parser.add_argument('--work-dir', help='the dir to save logs and models')
    parser.add_argument(
        '--resume-from', default='./work_dirs/res50_cam_radar/epoch_17.pth', help='the checkpoint file to resume from')
    parser.add_argument(
        '--no-validate',
        action='store_true',
        help='whether not to evaluate the checkpoint during training')
    group_gpus = parser.add_mutually_exclusive_group()

How should solve it to reload epoch_17?

nacayu avatar Aug 06 '22 17:08 nacayu

Please provide the information generated by python mmdet/utils/collect_env.py.

In lastest version mmdet v2.25.1, auto-resume is suupported, see https://github.com/open-mmlab/mmdetection/blob/3b72b12fe9b14de906d1363982b9fba05e7d47c1/tools/train.py#L32.

chhluo avatar Aug 07 '22 04:08 chhluo

@chhluo Envs:

TorchVision: 0.9.0
OpenCV: 4.5.5
MMCV: 1.3.8
MMCV Compiler: GCC 9.4
MMCV CUDA Compiler: 11.1
MMDetection: 2.14.0
MMSegmentation: 0.14.1
MMDetection3D: 0.17.0+34a4767

And then I tried to print meta:

model = torch.load('epoch_17.pth')
print(model['meta']['epoch'])

It also was:

7

Thanks for your reply

nacayu avatar Aug 07 '22 05:08 nacayu

@nacayu That's strange. Please try to load other checkpoints.

chhluo avatar Aug 07 '22 16:08 chhluo

@chhluo Also the same problem as before when I tried to load epoch_18 and so on. I do not know whether the same as https://github.com/open-mmlab/mmcv/pull/1105

nacayu avatar Aug 08 '22 01:08 nacayu

See https://github.com/open-mmlab/mmcv/pull/1108. https://github.com/open-mmlab/mmdetection/issues/5505 Please update mmcv to v1.3.18.

chhluo avatar Aug 08 '22 02:08 chhluo