fasterrcnn-pytorch-training-pipeline icon indicating copy to clipboard operation
fasterrcnn-pytorch-training-pipeline copied to clipboard

Training on a custom dataset - starting from COCO pre-trained weights?

Open utility-aagrawal opened this issue 2 years ago • 10 comments

Hi,

I want to train a fasterrcnn_resnet50_fpn_v2 model on a custom dataset. I want to start from COCO pre-trained weights. Is that the default behavior? or Do I need to supply a weights file thru --weights argument? If yes, where can I find that file?

Thank you for your help!

utility-aagrawal avatar Aug 14 '23 18:08 utility-aagrawal

Hello @utility-aagrawal It will load the COCO pretrained weights by default. You only need to provide --weights if you want to continue from one of your checkpoints.

sovit-123 avatar Aug 15 '23 00:08 sovit-123

Thanks for the quick response, @sovit-123! I have another question on the GPU memory usage. I get a CUDA out of memory error for a small batch size of 8. I have a Tesla T4 GPU with ~16GB memory. On the same machine, I am able to train a YOLOv8 (43M parameters) with a batch size of 16 but I can only do a batch size of 4 for this repository. I am using the same dataset and same input image size for both of these models. Do you know where this difference is coming from? Do you have any recommendations to speed up the training? My dataset has 15k images and with default image size 640, it's taking almost an hour for an epoch.

Batch size 8 - image

Batch size 4 - image

I really appreciate your help with this!

utility-aagrawal avatar Aug 15 '23 14:08 utility-aagrawal

Hi. Try using --imgsz 640 along with square resizing and AMP (Automatic Mixed Precision). Along with your command here are the additional arguments. Using AMP you can give double the batch size at most times. python train.py --imgsz 640 --square-training --amp --batch 8

One more reason for longer training time can be the default fasterrcnn_resnet50_fpn_v2 model. This V2 is a better model compared to V1 but has a heavier FPN network. Works very well with small objects. In case you are okay with slightly worse results but faster training try using --model fasterrcnn_resnet50_fpn

Can you please let me know how long one epoch takes with YOLOv8? Will help me optimize the repository even more.

sovit-123 avatar Aug 15 '23 14:08 sovit-123

With YOLOv8-large, it took around 20 mins for one epoch with a batch size of 16. I want to compare my YOLO model with a faster RCNN. I didn't use a square training for my YOLO model so I don't want to use it for faster RCNN but using --amp, I was able to start the training with a batch size 8. It's still in progress. I'll let you know how that goes. Thanks for your help!

utility-aagrawal avatar Aug 15 '23 15:08 utility-aagrawal

Sure. Thanks.

sovit-123 avatar Aug 15 '23 16:08 sovit-123

Hi, Just wanted to update you on the training time - it still seems pretty slow. It takes ~52 mins to complete an epoch.

utility-aagrawal avatar Aug 17 '23 13:08 utility-aagrawal

Hmm... That can be because of the fasterrcnn_resnet50_fpn_v2 model. Did you try with fasterrcnn_resnet50_fpn model?

sovit-123 avatar Aug 17 '23 13:08 sovit-123

I haven't tried fasterrcnn_resnet50_fpn model yet because I wanted to compare the best faster RCNN model with my YOLOv8 model. Unfortunately, training is too slow. It took a week to train the v2 model on ~15k images with --amp, --batch 8 and --imgsz 640 (without --square-training) on a 16G Tesla GPU. I was able to train a YOLOv8 (43M parameters) on the same machine using the same dataset and image size but with a batch size 16 in less than 48 hours. Let me know if you find a way to reduce the training time. For now, I'll be using the YOLO implementation. As for the performance, there are way too many false positives as compared to my YOLO model. Thanks for your help!

utility-aagrawal avatar Aug 23 '23 15:08 utility-aagrawal

Can it be that YOLO speeds it up thanks to the dataloader? They probably pre-load images and annotations ... are you monitoring I/O operations in the two training settings (YOLOv8 vs Faster R-CNN)?

emanuelevivoli avatar May 04 '24 16:05 emanuelevivoli

Faster RCNN is certainly slow to train compared to YOLO. However, it is not because of the data loader. Instead its because of the two stage nature of Faster RCNN.

sovit-123 avatar May 04 '24 16:05 sovit-123