fasterrcnn-pytorch-training-pipeline How can I properly resume my training?

My initial training command was the ff:

python train.py --data data_configs/custom_data.yaml --epochs 20 --model fasterrcnn_resnet50_fpn_v2 --name custom_training --batch 4 --imgsz 320

I want to train it for 20 epochs with a batch size of 4 wherein the images are 320x320.

I cancelled my training during the process of 5th epoch (4 if zero-indexed). Now, I would like to resume my training from where I stopped/cancelled. I would like to do this indefinitely until I finish the 20th epoch. Do you know the proper command for resuming training?

This is the command I tried:

python train.py --model fasterrcnn_resnet50_fpn_v2 --weights outputs\training\custom_training\last_model.pth --data data_configs\custom_data.yaml --imgsz 320 --name custom_training --resume

It worked but stopped after trying to save the best model for epoch 5 (4 if zero-indexed).

This is the error shown on the terminal:

OSError: [WinError 1314] A required privilege is not held by the client: 'C:\Users\X\Desktop\pd_faster_rcnn_pytorch\fasterrcnn-pytorch-training-pipeline\outputs\training\custom_training\best_model.pth' -> 'C:\Users\X\Desktop\pd_faster_rcnn_pytorch\fasterrcnn-pytorch-training-pipeline\wandb\offline-run-20231012_212412-eh5ul1xy\files\outputs\training\custom_training\best_model.pth'

I hope for your response regarding my inquiry. Thank you.

Oct 12 '23 13:10 JerickoDG

Hello. It seems that you do not have write access to the disk. Can you try running the code in a terminal as Administrator.

Oct 12 '23 14:10 sovit-123

Hi. Thanks for replying to my question. I tried adding a value for the --epochs parameter and set it to 20 resulting to this command:

python train.py --model fasterrcnn_resnet50_fpn_v2 --weights outputs\training\custom_training\last_model.pth --data data_configs\custom_data.yaml --imgsz 320 --name custom_training --epochs 20 --resume

It worked past the 6th epoch (5 if zero-indexed) and currently at the 7th epoch (6 if zero-indexed).

But I'll still observe the training and would try your suggestion if the aforementioned problem appeared again. I'll provide an update if ever. Thank you again.

Oct 12 '23 14:10 JerickoDG

Got. I had not seen that you did not pass the epochs argument in the previous command. Glad that it was solved.

Oct 12 '23 14:10 sovit-123

fasterrcnn-pytorch-training-pipeline fasterrcnn-pytorch-training-pipeline copied to clipboard

How can I properly resume my training?

fasterrcnn-pytorch-training-pipeline
fasterrcnn-pytorch-training-pipeline copied to clipboard