fasterrcnn-pytorch-training-pipeline
fasterrcnn-pytorch-training-pipeline copied to clipboard
How can I properly resume my training?
My initial training command was the ff:
python train.py --data data_configs/custom_data.yaml --epochs 20 --model fasterrcnn_resnet50_fpn_v2 --name custom_training --batch 4 --imgsz 320
I want to train it for 20 epochs with a batch size of 4 wherein the images are 320x320.
I cancelled my training during the process of 5th epoch (4 if zero-indexed). Now, I would like to resume my training from where I stopped/cancelled. I would like to do this indefinitely until I finish the 20th epoch. Do you know the proper command for resuming training?
This is the command I tried:
python train.py --model fasterrcnn_resnet50_fpn_v2 --weights outputs\training\custom_training\last_model.pth --data data_configs\custom_data.yaml --imgsz 320 --name custom_training --resume
It worked but stopped after trying to save the best model for epoch 5 (4 if zero-indexed).
This is the error shown on the terminal:
OSError: [WinError 1314] A required privilege is not held by the client: 'C:\Users\X\Desktop\pd_faster_rcnn_pytorch\fasterrcnn-pytorch-training-pipeline\outputs\training\custom_training\best_model.pth' -> 'C:\Users\X\Desktop\pd_faster_rcnn_pytorch\fasterrcnn-pytorch-training-pipeline\wandb\offline-run-20231012_212412-eh5ul1xy\files\outputs\training\custom_training\best_model.pth'
I hope for your response regarding my inquiry. Thank you.
Hello. It seems that you do not have write access to the disk. Can you try running the code in a terminal as Administrator.
Hi. Thanks for replying to my question. I tried adding a value for the --epochs parameter and set it to 20 resulting to this command:
python train.py --model fasterrcnn_resnet50_fpn_v2 --weights outputs\training\custom_training\last_model.pth --data data_configs\custom_data.yaml --imgsz 320 --name custom_training --epochs 20 --resume
It worked past the 6th epoch (5 if zero-indexed) and currently at the 7th epoch (6 if zero-indexed).
But I'll still observe the training and would try your suggestion if the aforementioned problem appeared again. I'll provide an update if ever. Thank you again.
Got. I had not seen that you did not pass the epochs argument in the previous command. Glad that it was solved.