inference
inference copied to clipboard
[URGENT] RetinaNet mAP is 37.50% on main, 37.55% on r2.1 branch
According to the README in the main branch, the reference fp32 accuracy of RetinaNet is 37.5%
or 37.50%
. With this refence value, the 99% threshold for valid submissions is 37.50% * 0.99 = 37.125%
.
According to the README in the r2.1 branch, the reference fp32 accuracy of RetinaNet is 37.55%
. With this reference value, the 99% threshold for valid submissions is 37.55% * 0.99 = 37.175%
.
We believe that 37.50%
is what has been agreed upon by the MLPerf Inference WG. Therefore, the 99% threshold must be taken as 37.125%
for this round.
Taken from the latest meeting slides
Accuracy target
FP32: 0.375 mAP
Latency target
WG Approves:100 msec
Copy of 2022.08.02 MLCommons Inference WG Meeting - RetinaNet.pdf
@psyhtest Accuracy and Latency targets should be referred to Inference rules https://github.com/mlcommons/inference_policies/blob/master/inference_rules.adoc#411-constraints-for-the-closed-division. I am not sure it is appropriate to change the rules at this late without discussion. @tjablin Can you comment?
The Readme was updated in PR1168 on 6/29. The PR says, "A decimal place for the new object detection model was missing in the README and in the submission checker. I missed it also in zenodo, so I added it there as well." It sounds like the accuracy was truncated instead of rounded. The MLPerf Inference Rules say, "99% of FP32 (0.3755 mAP)" as of PR251 on 7/21. PR251 says the rules are being updated to match the submission checker. The submission checker was updated in PR1168 on 6/29.
Sorry, I missed PR1168 (submission checker change) 5 weeks ago. But PR was only two weeks ago.
According to the attached slide from the WG meeting slide deck 4 weeks ago, the rule freeze should have happened 10-11 weeks ago and code freeze 8-9 weeks ago:
week - 11 | 05/20/2022 | Power rule freeze, power tool freeze. |
---|---|---|
week - 11 | 05/20/2022 | Inference rule freeze, non-model / measurement methodology freeze |
-- | -- | -- |
week - 9 | 06/03/2022 | Code freeze (functionality freeze)Includes automated submission checker |
-- | -- | -- |
Isn't there a contradiction?
We went by what we saw in Slides. We did miss the change made as late as 7/21 https://github.com/mlcommons/inference_policies/pull/251. And ran into submission checker issue just before making submission.
What are options to make submission with Target FP32 mAP = 0.375?
Can't work 24 hrs before submissions to get the updated accuracy now as we have several submission tasks ongoing at this stage.
Need Waiver ASAP for this submission.
Where do we draw a line between cases like this and (non-WG approved) late changes like https://github.com/mlcommons/inference/pull/1206?
This is the accuracy we got with the reference implementation (fp32) using onnxruntime backend. Not sure why but mAP is lower than 37.575% ( 37.572% )
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.376
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.525
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.406
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.025
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.127
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.415
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.420
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.598
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.627
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.082
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.341
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.677
mAP=37.572%
Thank you Arjun. Your accuracy is still higher than 37.55%.
My bad. Sorry I was careless and misread the decimal points. Below are the official figures as in r2.1 branch
- Pytorch model: 0.3755
- Onnx model: 0.3757 (0.37572 reproduced by us)
Inference Policies say the accuracy requirement as 99% of FP32 model, but does not clearly say which exact model (lowest accuracy one?) nor say up to how many digits after the decimal point. If submitter are to follow the exact figure given there, it is bad that master branch still has 37.5 given there and not 37.55 and 37.57 as in r2.1 branch. r2.1 is the branch used for submission but master
branch should not have wrong/misleading information. I hope such issues are avoided going forward while changes are made.
Inference Policies say the accuracy requirement as 99% of FP32 model, but does not clearly say which exact model (lowest accuracy one?) nor say up to how many digits after the decimal point
Actually, one part of the rules says:
Accuracy results must be reported to five significant figures with round to even. For example, 98.9995% should be recorded as 99.000%.
So neither 37.5% (3 significant digits) nor 37.55% (4 significant digits) is correct.
This issue is resolved during v2.1 result review meetings.