PyTorch_YOLOv3
PyTorch_YOLOv3 copied to clipboard
confused about the train loss、size_average and the performance.
Hi, @hirotomusiker.
I come here again. As the title said, I am confused about the train loss、size_average and the performance. I have train the original darknet repo and this repo on my own dataset (3 classes). And I want to share the results here.
The params are same: MAXITER: 6000, STEPS: (4800, 5400), IMGSIZE: 608 (both for train and test).
With darknet, I gain the [email protected] as 79.0, and the final loss was 0.76 (avg).
With this repo, the [email protected] was 76.9, and the final loss was 4.7 (total).
It seens that with this repo, the loss is harder to converge. So I changed the params for this repo (MAXITER: 8000, STEPS: (6400, 7200)), and gain the [email protected] as 78.3, and the final loss was 8.2 (total).
So I have some questions.
- the performance seens different, may be caused by the shuffle of the dataset?
- the loss of this repo is larger and harder to converge compared to the darknet. What's the reason?
- in #44, you haved talked about the param
size_average
and said that the loss of darknet is also high?
- I cannot reproduce your training but AP can randomly change if your dataset is not large enough and if the training has not converged. I recommend to plot the val AP and make sure your val AP has reached the plateau.
- Variation of loss values between iterations is large because number of GT objects affects loss.
- Logged loss of darknet (0.76 in your case) is batch-summed loss. If the batchsize is 64, darknet log-loss is 64x higher than ours. The loss value is only for logging and does not affect the training performance.
Hi, @hirotomusiker. Sorry for the late reply. I do as you said and gain a good result. However, I found there is no setting for repetition. So I add the seed setting before starting the training loop.
def setup_seed(seed):
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
np.random.seed(seed)
random.seed(seed)
torch.backends.cudnn.deterministic = True
But I faild to get the same result. Any suggestions?
Thank you, I've tried your seed setting and got the same loss results.
Yes, in the first several epochs, like 100~200, the loss seems the same. But there is still slightly different if you observe the decimal places, just as the images below:
And as the number of iterations increases, the loss difference becomes larger and larger, leading to the difference in the map.
I think this is due to the randomness of the underlying implementation of Pytorch, such as the cuda implementation of the up-sample layer. Any suggestions?
I have tried again and checked 40 iterations on COCO: 1st:
[Iter 0/500000] [lr 0.000000] [Losses: xy 43.622276, wh 16.042191, conf 67708.421875, cls 892.703674, total 25170.322266, imgsize 608]
[Iter 10/500000] [lr 0.000000] [Losses: xy 63.709991, wh 25.143564, conf 18768.097656, cls 1275.747925, total 7396.792969, imgsize 320]
[Iter 20/500000] [lr 0.000000] [Losses: xy 116.392715, wh 48.034309, conf 31668.382812, cls 2430.618652, total 12567.701172, imgsize 416]
2nd:
[Iter 0/500000] [lr 0.000000] [Losses: xy 43.622276, wh 16.042191, conf 67708.421875, cls 892.703674, total 25170.322266, imgsize 608]
[Iter 10/500000] [lr 0.000000] [Losses: xy 63.709991, wh 25.143564, conf 18768.097656, cls 1275.747925, total 7396.792969, imgsize 320]
[Iter 20/500000] [lr 0.000000] [Losses: xy 116.392715, wh 48.034309, conf 31668.382812, cls 2430.618652, total 12567.701172, imgsize 416]
The results are exactly the same.
- Please set the learning rate = 0.0 and see what happens.
- Please try it again with this repo without modification except the random seed part.
Hi,@chengcchn ,I want to know how you get the AP,I follow the author's instruction cann't evalute the trained moudle .