YOLOX
YOLOX copied to clipboard
An error occurred while resuming training from previous training results
I have two questions to ask:
- Is it possible to resume training from the results of a completed workout? For example, if I have trained 100 epochs and got the last_epoch_ckpt.pth file, can I change the max_epoch of the configuration file to 200 and resume the training from the last_epoch_ckpt.pth.
- When I tried to do this I encountered the following error:
2022-02-23 01:32:51.186 | INFO | yolox.core.launch:_distributed_worker:116 - Rank 1 initialization finished.
2022-02-23 01:32:51.312 | INFO | yolox.core.launch:_distributed_worker:116 - Rank 0 initialization finished.
2022-02-23 01:33:02.139 | INFO | yolox.utils.setup_env:configure_omp:41 -
We set `OMP_NUM_THREADS` for each process to 1 to speed up.
please further tune the variable for optimal performance.
2022-02-23 01:33:02 | INFO | yolox.core.trainer:127 - args: Namespace(experiment_name='yolox_x', name=None, dist_backend='nccl', dist_url=None, batch_size=32, devices=2, exp_file='exps/yolox_x.py', resume=True, ckpt='YOLOX_outputs/yolox_x_100epoch/last_mosaic_epoch_ckpt.pth', start_epoch=None, num_machines=1, machine_rank=0, fp16=False, cache=False, occupy=True, opts=[])
2022-02-23 01:33:02 | INFO | yolox.core.trainer:128 - exp value:
(exp value has been ignored here)
2022-02-23 01:33:06 | INFO | yolox.core.trainer:133 - Model Summary: Params: 99.00M, Gflops: 281.50
2022-02-23 01:33:06 | INFO | yolox.core.trainer:263 - resume training
2022-02-23 01:33:19 | INFO | yolox.core.trainer:281 - loaded checkpoint 'True' (epoch 85)
2022-02-23 01:33:19 | INFO | yolox.data.datasets.coco:63 - loading annotations into memory...
2022-02-23 01:33:19 | INFO | yolox.data.datasets.coco:63 - Done (t=0.35s)
2022-02-23 01:33:19 | INFO | pycocotools.coco:86 - creating index...
2022-02-23 01:33:19 | INFO | pycocotools.coco:86 - index created!
2022-02-23 01:33:21 | INFO | yolox.core.trainer:152 - init prefetcher, this might take one minute or less...
2022-02-23 01:33:43 | INFO | yolox.data.datasets.coco:63 - loading annotations into memory...
2022-02-23 01:33:43 | INFO | yolox.data.datasets.coco:63 - Done (t=0.01s)
2022-02-23 01:33:43 | INFO | pycocotools.coco:86 - creating index...
2022-02-23 01:33:43 | INFO | pycocotools.coco:86 - index created!
2022-02-23 01:33:43 | INFO | yolox.core.trainer:180 - Training start...
(Model Summary has been ignored here.)
2022-02-23 01:33:43 | INFO | yolox.core.trainer:184 - Training of experiment is done and the best AP is 44.20
/data/home/scv1442/.conda/envs/pytorch/lib/python3.9/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/conda/conda-bld/pytorch_1634272068185/work/aten/src/ATen/native/TensorShape.cpp:2157.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
/data/home/scv1442/.conda/envs/pytorch/lib/python3.9/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/conda/conda-bld/pytorch_1634272068185/work/aten/src/ATen/native/TensorShape.cpp:2157.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
python tools/train.py -f exps/yolox_x.py -d 2 -b 32 -o --resume -c YOLOX_outputs/yolox_x_100epoch/last_epoch_ckpt.pth
- Yes, you could and you will train 200epoch started at 100e.
- It's warning due to API change in pytorch, don't worry.
- Yes, you could and you will train 200epoch started at 100e.
- It's warning due to API change in pytorch, don't worry.
Thanks for your reply. I understand it is just a warning. However, the training is terminated when this Warning occurs, so it cannot be resumed from the last complete training.
2022-02-23 01:33:43 | INFO | yolox.core.trainer:184 - Training of experiment is done and the best AP is 44.20
The reason seems to be that something in the 70 line of source code yolox/core/trainer.py caught an exception that caused the training to terminate, so I don't know what exception caused it. Do you have any idea about that?
def train(self):
self.before_train()
try:
self.train_in_epoch()
except Exception:
raise
finally:
self.after_train()
Any exception trace log @66Kevin ?
Any exception trace log @66Kevin ?
Sorry, I don't get any other exception trace log. So I am very confused that why it was terminated.
Hi, I'm planning to do a similar thing, were you able to resolve this issue? @66Kevin @FateScript