pytorch-faster-rcnn icon indicating copy to clipboard operation
pytorch-faster-rcnn copied to clipboard

Training loss has been oscillating and does not converge.

Open yuanyao366 opened this issue 6 years ago • 2 comments

My training loss has been oscillating and does not converge. All parameters are default. My training command is : ./experiments/scripts/train_faster_rcnn_notime.sh 1 pascal_voc vgg16

yuanyao366 avatar Mar 15 '18 08:03 yuanyao366

This is my training print-out: iter: 16320 / 70000, total loss: 0.484521

rpn_loss_cls: 0.063619 rpn_loss_box: 0.022290 loss_cls: 0.232639 loss_box: 0.165972 lr: 0.001000 speed: 0.878s / iter iter: 16340 / 70000, total loss: 0.325996 rpn_loss_cls: 0.174814 rpn_loss_box: 0.084035 loss_cls: 0.022101 loss_box: 0.045046 lr: 0.001000 speed: 0.878s / iter iter: 16360 / 70000, total loss: 0.839716 rpn_loss_cls: 0.232109 rpn_loss_box: 0.024567 loss_cls: 0.340812 loss_box: 0.242227 lr: 0.001000 speed: 0.877s / iter iter: 16380 / 70000, total loss: 1.083268 rpn_loss_cls: 0.136527 rpn_loss_box: 0.011490 loss_cls: 0.676050 loss_box: 0.259201 lr: 0.001000 speed: 0.877s / iter iter: 16400 / 70000, total loss: 0.413278 rpn_loss_cls: 0.059325 rpn_loss_box: 0.088252 loss_cls: 0.086681 loss_box: 0.179020 lr: 0.001000 speed: 0.877s / iter iter: 16420 / 70000, total loss: 0.380816 rpn_loss_cls: 0.041944 rpn_loss_box: 0.044617 loss_cls: 0.143582 loss_box: 0.150674 lr: 0.001000 speed: 0.877s / iter iter: 16440 / 70000, total loss: 0.223295 rpn_loss_cls: 0.105737 rpn_loss_box: 0.008052 loss_cls: 0.077547 loss_box: 0.031959 lr: 0.001000 speed: 0.877s / iter iter: 16460 / 70000, total loss: 0.196370 rpn_loss_cls: 0.013646 rpn_loss_box: 0.031613 loss_cls: 0.074249 loss_box: 0.076861 lr: 0.001000 speed: 0.877s / iter iter: 16480 / 70000, total loss: 0.787172 rpn_loss_cls: 0.056828 rpn_loss_box: 0.038711 loss_cls: 0.394041 loss_box: 0.297591 lr: 0.001000

There are some advices for that? Thanks a lot!

yuanyao366 avatar Mar 15 '18 09:03 yuanyao366

Did you finish training it. The loss does oscillate because the batch size is small.

ruotianluo avatar Apr 04 '18 21:04 ruotianluo