YOLOX icon indicating copy to clipboard operation
YOLOX copied to clipboard

An error occurred while resuming training from previous training results

Open 66Kevin opened this issue 3 years ago • 5 comments

I have two questions to ask:

  • Is it possible to resume training from the results of a completed workout? For example, if I have trained 100 epochs and got the last_epoch_ckpt.pth file, can I change the max_epoch of the configuration file to 200 and resume the training from the last_epoch_ckpt.pth.
  • When I tried to do this I encountered the following error:
2022-02-23 01:32:51.186 | INFO     | yolox.core.launch:_distributed_worker:116 - Rank 1 initialization finished.
2022-02-23 01:32:51.312 | INFO     | yolox.core.launch:_distributed_worker:116 - Rank 0 initialization finished.
2022-02-23 01:33:02.139 | INFO     | yolox.utils.setup_env:configure_omp:41 - 
We set `OMP_NUM_THREADS` for each process to 1 to speed up.
please further tune the variable for optimal performance.
2022-02-23 01:33:02 | INFO     | yolox.core.trainer:127 - args: Namespace(experiment_name='yolox_x', name=None, dist_backend='nccl', dist_url=None, batch_size=32, devices=2, exp_file='exps/yolox_x.py', resume=True, ckpt='YOLOX_outputs/yolox_x_100epoch/last_mosaic_epoch_ckpt.pth', start_epoch=None, num_machines=1, machine_rank=0, fp16=False, cache=False, occupy=True, opts=[])
2022-02-23 01:33:02 | INFO     | yolox.core.trainer:128 - exp value:
(exp value has been ignored here)
2022-02-23 01:33:06 | INFO     | yolox.core.trainer:133 - Model Summary: Params: 99.00M, Gflops: 281.50
2022-02-23 01:33:06 | INFO     | yolox.core.trainer:263 - resume training
2022-02-23 01:33:19 | INFO     | yolox.core.trainer:281 - loaded checkpoint 'True' (epoch 85)
2022-02-23 01:33:19 | INFO     | yolox.data.datasets.coco:63 - loading annotations into memory...
2022-02-23 01:33:19 | INFO     | yolox.data.datasets.coco:63 - Done (t=0.35s)
2022-02-23 01:33:19 | INFO     | pycocotools.coco:86 - creating index...
2022-02-23 01:33:19 | INFO     | pycocotools.coco:86 - index created!
2022-02-23 01:33:21 | INFO     | yolox.core.trainer:152 - init prefetcher, this might take one minute or less...
2022-02-23 01:33:43 | INFO     | yolox.data.datasets.coco:63 - loading annotations into memory...
2022-02-23 01:33:43 | INFO     | yolox.data.datasets.coco:63 - Done (t=0.01s)
2022-02-23 01:33:43 | INFO     | pycocotools.coco:86 - creating index...
2022-02-23 01:33:43 | INFO     | pycocotools.coco:86 - index created!
2022-02-23 01:33:43 | INFO     | yolox.core.trainer:180 - Training start...
(Model Summary has been ignored here.)
2022-02-23 01:33:43 | INFO     | yolox.core.trainer:184 - Training of experiment is done and the best AP is 44.20
/data/home/scv1442/.conda/envs/pytorch/lib/python3.9/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  /opt/conda/conda-bld/pytorch_1634272068185/work/aten/src/ATen/native/TensorShape.cpp:2157.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
/data/home/scv1442/.conda/envs/pytorch/lib/python3.9/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  /opt/conda/conda-bld/pytorch_1634272068185/work/aten/src/ATen/native/TensorShape.cpp:2157.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
  • python tools/train.py -f exps/yolox_x.py -d 2 -b 32 -o --resume -c YOLOX_outputs/yolox_x_100epoch/last_epoch_ckpt.pth

66Kevin avatar Feb 23 '22 01:02 66Kevin

  1. Yes, you could and you will train 200epoch started at 100e.
  2. It's warning due to API change in pytorch, don't worry.

FateScript avatar Feb 23 '22 03:02 FateScript

  1. Yes, you could and you will train 200epoch started at 100e.
  2. It's warning due to API change in pytorch, don't worry.

Thanks for your reply. I understand it is just a warning. However, the training is terminated when this Warning occurs, so it cannot be resumed from the last complete training. 2022-02-23 01:33:43 | INFO | yolox.core.trainer:184 - Training of experiment is done and the best AP is 44.20 The reason seems to be that something in the 70 line of source code yolox/core/trainer.py caught an exception that caused the training to terminate, so I don't know what exception caused it. Do you have any idea about that?

def train(self):
        self.before_train()
        try:
            self.train_in_epoch()
        except Exception:
            raise
        finally:
            self.after_train()

66Kevin avatar Feb 23 '22 14:02 66Kevin

Any exception trace log @66Kevin ?

FateScript avatar Feb 24 '22 06:02 FateScript

Any exception trace log @66Kevin ?

Sorry, I don't get any other exception trace log. So I am very confused that why it was terminated.

66Kevin avatar Feb 24 '22 13:02 66Kevin

Hi, I'm planning to do a similar thing, were you able to resolve this issue? @66Kevin @FateScript

sanbene avatar Sep 29 '23 13:09 sanbene