PyTorch_YOLOv3 confused about the train loss、size

Hi, @hirotomusiker. I come here again. As the title said, I am confused about the train loss、size_average and the performance. I have train the original darknet repo and this repo on my own dataset (3 classes). And I want to share the results here. The params are same: MAXITER: 6000, STEPS: (4800, 5400), IMGSIZE: 608 (both for train and test). With darknet, I gain the [email protected] as 79.0, and the final loss was 0.76 (avg). With this repo, the [email protected] was 76.9, and the final loss was 4.7 (total). It seens that with this repo, the loss is harder to converge. So I changed the params for this repo (MAXITER: 8000, STEPS: (6400, 7200)), and gain the [email protected] as 78.3, and the final loss was 8.2 (total). So I have some questions.

the performance seens different, may be caused by the shuffle of the dataset?
the loss of this repo is larger and harder to converge compared to the darknet. What's the reason?
in #44, you haved talked about the param size_average and said that the loss of darknet is also high?

Mar 04 '20 11:03 chengcchn

I cannot reproduce your training but AP can randomly change if your dataset is not large enough and if the training has not converged. I recommend to plot the val AP and make sure your val AP has reached the plateau.
Variation of loss values between iterations is large because number of GT objects affects loss.
Logged loss of darknet (0.76 in your case) is batch-summed loss. If the batchsize is 64, darknet log-loss is 64x higher than ours. The loss value is only for logging and does not affect the training performance.

Mar 04 '20 13:03 hirotomusiker

Hi, @hirotomusiker. Sorry for the late reply. I do as you said and gain a good result. However, I found there is no setting for repetition. So I add the seed setting before starting the training loop.

def setup_seed(seed): torch.manual_seed(seed) torch.cuda.manual_seed_all(seed) np.random.seed(seed) random.seed(seed) torch.backends.cudnn.deterministic = True But I faild to get the same result. Any suggestions?

Mar 07 '20 07:03 chengcchn

Thank you, I've tried your seed setting and got the same loss results.

Mar 08 '20 03:03 hirotomusiker

Yes, in the first several epochs, like 100~200, the loss seems the same. But there is still slightly different if you observe the decimal places, just as the images below: Snipaste_2020-03-07_18-47-19 Snipaste_2020-03-07_18-47-28 And as the number of iterations increases, the loss difference becomes larger and larger, leading to the difference in the map. Snipaste_2020-03-16_18-05-24 Snipaste_2020-03-16_18-05-56 I think this is due to the randomness of the underlying implementation of Pytorch, such as the cuda implementation of the up-sample layer. Any suggestions?

Mar 16 '20 10:03 chengcchn

I have tried again and checked 40 iterations on COCO: 1st:

[Iter 0/500000] [lr 0.000000] [Losses: xy 43.622276, wh 16.042191, conf 67708.421875, cls 892.703674, total 25170.322266, imgsize 608]
[Iter 10/500000] [lr 0.000000] [Losses: xy 63.709991, wh 25.143564, conf 18768.097656, cls 1275.747925, total 7396.792969, imgsize 320]
[Iter 20/500000] [lr 0.000000] [Losses: xy 116.392715, wh 48.034309, conf 31668.382812, cls 2430.618652, total 12567.701172, imgsize 416]

2nd:

[Iter 0/500000] [lr 0.000000] [Losses: xy 43.622276, wh 16.042191, conf 67708.421875, cls 892.703674, total 25170.322266, imgsize 608]
[Iter 10/500000] [lr 0.000000] [Losses: xy 63.709991, wh 25.143564, conf 18768.097656, cls 1275.747925, total 7396.792969, imgsize 320]
[Iter 20/500000] [lr 0.000000] [Losses: xy 116.392715, wh 48.034309, conf 31668.382812, cls 2430.618652, total 12567.701172, imgsize 416]

The results are exactly the same.

Please set the learning rate = 0.0 and see what happens.
Please try it again with this repo without modification except the random seed part.

Mar 22 '20 08:03 hirotomusiker

Hi，@chengcchn ,I want to know how you get the AP，I follow the author's instruction cann't evalute the trained moudle .

May 02 '20 13:05 Renascence6

PyTorch_YOLOv3 PyTorch_YOLOv3 copied to clipboard

confused about the train loss、size_average and the performance.

PyTorch_YOLOv3
PyTorch_YOLOv3 copied to clipboard