fasterrcnn-pytorch-training-pipeline icon indicating copy to clipboard operation
fasterrcnn-pytorch-training-pipeline copied to clipboard

How can I properly resume my training?

Open JerickoDG opened this issue 1 year ago • 3 comments

My initial training command was the ff:

python train.py --data data_configs/custom_data.yaml --epochs 20 --model fasterrcnn_resnet50_fpn_v2 --name custom_training --batch 4 --imgsz 320

I want to train it for 20 epochs with a batch size of 4 wherein the images are 320x320.

I cancelled my training during the process of 5th epoch (4 if zero-indexed). Now, I would like to resume my training from where I stopped/cancelled. I would like to do this indefinitely until I finish the 20th epoch. Do you know the proper command for resuming training?

This is the command I tried:

python train.py --model fasterrcnn_resnet50_fpn_v2 --weights outputs\training\custom_training\last_model.pth --data data_configs\custom_data.yaml --imgsz 320 --name custom_training --resume

It worked but stopped after trying to save the best model for epoch 5 (4 if zero-indexed).

This is the error shown on the terminal:

OSError: [WinError 1314] A required privilege is not held by the client: 'C:\Users\X\Desktop\pd_faster_rcnn_pytorch\fasterrcnn-pytorch-training-pipeline\outputs\training\custom_training\best_model.pth' -> 'C:\Users\X\Desktop\pd_faster_rcnn_pytorch\fasterrcnn-pytorch-training-pipeline\wandb\offline-run-20231012_212412-eh5ul1xy\files\outputs\training\custom_training\best_model.pth'

I hope for your response regarding my inquiry. Thank you.

JerickoDG avatar Oct 12 '23 13:10 JerickoDG

Hello. It seems that you do not have write access to the disk. Can you try running the code in a terminal as Administrator.

sovit-123 avatar Oct 12 '23 14:10 sovit-123

Hi. Thanks for replying to my question. I tried adding a value for the --epochs parameter and set it to 20 resulting to this command:

python train.py --model fasterrcnn_resnet50_fpn_v2 --weights outputs\training\custom_training\last_model.pth --data data_configs\custom_data.yaml --imgsz 320 --name custom_training --epochs 20 --resume

It worked past the 6th epoch (5 if zero-indexed) and currently at the 7th epoch (6 if zero-indexed).

But I'll still observe the training and would try your suggestion if the aforementioned problem appeared again. I'll provide an update if ever. Thank you again.

JerickoDG avatar Oct 12 '23 14:10 JerickoDG

Got. I had not seen that you did not pass the epochs argument in the previous command. Glad that it was solved.

sovit-123 avatar Oct 12 '23 14:10 sovit-123