PyTorch_YOLOv3 icon indicating copy to clipboard operation
PyTorch_YOLOv3 copied to clipboard

confused about the train loss、size_average and the performance.

Open chengcchn opened this issue 4 years ago • 6 comments

Hi, @hirotomusiker. I come here again. As the title said, I am confused about the train loss、size_average and the performance. I have train the original darknet repo and this repo on my own dataset (3 classes). And I want to share the results here. The params are same: MAXITER: 6000, STEPS: (4800, 5400), IMGSIZE: 608 (both for train and test). With darknet, I gain the [email protected] as 79.0, and the final loss was 0.76 (avg). image With this repo, the [email protected] was 76.9, and the final loss was 4.7 (total). image It seens that with this repo, the loss is harder to converge. So I changed the params for this repo (MAXITER: 8000, STEPS: (6400, 7200)), and gain the [email protected] as 78.3, and the final loss was 8.2 (total). image image So I have some questions.

  1. the performance seens different, may be caused by the shuffle of the dataset?
  2. the loss of this repo is larger and harder to converge compared to the darknet. What's the reason?
  3. in #44, you haved talked about the param size_average and said that the loss of darknet is also high?

chengcchn avatar Mar 04 '20 11:03 chengcchn

  1. I cannot reproduce your training but AP can randomly change if your dataset is not large enough and if the training has not converged. I recommend to plot the val AP and make sure your val AP has reached the plateau.
  2. Variation of loss values between iterations is large because number of GT objects affects loss.
  3. Logged loss of darknet (0.76 in your case) is batch-summed loss. If the batchsize is 64, darknet log-loss is 64x higher than ours. The loss value is only for logging and does not affect the training performance.

hirotomusiker avatar Mar 04 '20 13:03 hirotomusiker

Hi, @hirotomusiker. Sorry for the late reply. I do as you said and gain a good result. However, I found there is no setting for repetition. So I add the seed setting before starting the training loop.

def setup_seed(seed): torch.manual_seed(seed) torch.cuda.manual_seed_all(seed) np.random.seed(seed) random.seed(seed) torch.backends.cudnn.deterministic = True image But I faild to get the same result. Any suggestions?

chengcchn avatar Mar 07 '20 07:03 chengcchn

Thank you, I've tried your seed setting and got the same loss results.

hirotomusiker avatar Mar 08 '20 03:03 hirotomusiker

Yes, in the first several epochs, like 100~200, the loss seems the same. But there is still slightly different if you observe the decimal places, just as the images below: Snipaste_2020-03-07_18-47-19 Snipaste_2020-03-07_18-47-28 And as the number of iterations increases, the loss difference becomes larger and larger, leading to the difference in the map. Snipaste_2020-03-16_18-05-24 Snipaste_2020-03-16_18-05-56 I think this is due to the randomness of the underlying implementation of Pytorch, such as the cuda implementation of the up-sample layer. Any suggestions?

chengcchn avatar Mar 16 '20 10:03 chengcchn

I have tried again and checked 40 iterations on COCO: 1st:

[Iter 0/500000] [lr 0.000000] [Losses: xy 43.622276, wh 16.042191, conf 67708.421875, cls 892.703674, total 25170.322266, imgsize 608]
[Iter 10/500000] [lr 0.000000] [Losses: xy 63.709991, wh 25.143564, conf 18768.097656, cls 1275.747925, total 7396.792969, imgsize 320]
[Iter 20/500000] [lr 0.000000] [Losses: xy 116.392715, wh 48.034309, conf 31668.382812, cls 2430.618652, total 12567.701172, imgsize 416]

2nd:

[Iter 0/500000] [lr 0.000000] [Losses: xy 43.622276, wh 16.042191, conf 67708.421875, cls 892.703674, total 25170.322266, imgsize 608]
[Iter 10/500000] [lr 0.000000] [Losses: xy 63.709991, wh 25.143564, conf 18768.097656, cls 1275.747925, total 7396.792969, imgsize 320]
[Iter 20/500000] [lr 0.000000] [Losses: xy 116.392715, wh 48.034309, conf 31668.382812, cls 2430.618652, total 12567.701172, imgsize 416]

The results are exactly the same.

  • Please set the learning rate = 0.0 and see what happens.
  • Please try it again with this repo without modification except the random seed part.

hirotomusiker avatar Mar 22 '20 08:03 hirotomusiker

Hi,@chengcchn ,I want to know how you get the AP,I follow the author's instruction cann't evalute the trained moudle .

Renascence6 avatar May 02 '20 13:05 Renascence6