ScaledYOLOv4
ScaledYOLOv4 copied to clipboard
The inconsistency of conf_thres value in test.py and detect.py
In this repository,I see that the default value of conf_thres in test.py is 0.001, but in detect.py, this is changed to 0.4. I also find that after the trainning process, my weight file can achieve really good mAP at 0.962 on my dataset using test.py and set conf_thres at 0.001, but only 0.889 if I switch it to 0.4 in test process. However, If I actually want to detect a new picture, it perfoms ok using detect.py with conf_thrs at 0.4, but outputs many wrong bbox with low confidence when conf_thres is set at 0.001. I guess this causes inconsistency.Since the test process offer the ground true bbox and make the predicted bbox with highest giou(with ground true bbox), the detect process don't know the ground truth and can only set a high conf_thres to filter most wrong ones in the same position. This actually make the excellent result in test process can't be reproduced in detect process and more application in the real world without knowing the truth. I want to confirm if the best mAP on coco dataset in the paper is achieved with 0.001 conf_thres or any other value. @WongKinYiu And I also noticed this phenomenon in yolov3 and yolov5 code offered by Ultralytics. I'm really confused with this. Maybe I should train another weight which performs the best when conf_thres is 0.4. Or maybe I'll study more about this and try to figure out other solutions.Thanks for your help!
Hello, the definition of AP is area under the PR curve, so it need to integral confidence score from 0 to 1 by definition, using 0.001 is due to computer is a discrete system, and we ignore confidence score from 0 to 0.001 (almost no effect on AP) to speed up AP calculation.
For real world inference, usually we care about unique recall or precision or both, not AP. For example, we often use confidence score which can get equal error rate (precision == recall), unique precision score (P >= 0.995), unique recall score (R >= 0.995), or unique F-score (F-score >= 0.99), etc.
Yes you could set AP calculation using 0.4 or something in training process, but you will loss PR information too choose the best confidence score you may want to use in your case with above mentioned criteria.
Since the full PR curve and AP calculation have more advantage for us to choose the best hyper-parameter on final deployed model, I suggest you add multiple output like AP, AP at 0.4 Conf, AP at 0.5 Conf, ... if you need those information during training.
Thanks so much for your help!!! Now I understand why! That's a really good suggestion, I'll try it now!