inference icon indicating copy to clipboard operation
inference copied to clipboard

[URGENT] RetinaNet mAP is 37.50% on main, 37.55% on r2.1 branch

Open psyhtest opened this issue 2 years ago • 10 comments

According to the README in the main branch, the reference fp32 accuracy of RetinaNet is 37.5% or 37.50%. With this refence value, the 99% threshold for valid submissions is 37.50% * 0.99 = 37.125%.

According to the README in the r2.1 branch, the reference fp32 accuracy of RetinaNet is 37.55%. With this reference value, the 99% threshold for valid submissions is 37.55% * 0.99 = 37.175%.

We believe that 37.50% is what has been agreed upon by the MLPerf Inference WG. Therefore, the 99% threshold must be taken as 37.125% for this round.

psyhtest avatar Aug 04 '22 11:08 psyhtest

Taken from the latest meeting slides

Accuracy target
FP32: 0.375 mAP
Latency target
WG Approves:100 msec

Copy of 2022.08.02 MLCommons Inference WG Meeting - RetinaNet.pdf

psyhtest avatar Aug 04 '22 12:08 psyhtest

@psyhtest Accuracy and Latency targets should be referred to Inference rules https://github.com/mlcommons/inference_policies/blob/master/inference_rules.adoc#411-constraints-for-the-closed-division. I am not sure it is appropriate to change the rules at this late without discussion. @tjablin Can you comment?

rnaidu02 avatar Aug 04 '22 14:08 rnaidu02

The Readme was updated in PR1168 on 6/29. The PR says, "A decimal place for the new object detection model was missing in the README and in the submission checker. I missed it also in zenodo, so I added it there as well." It sounds like the accuracy was truncated instead of rounded. The MLPerf Inference Rules say, "99% of FP32 (0.3755 mAP)" as of PR251 on 7/21. PR251 says the rules are being updated to match the submission checker. The submission checker was updated in PR1168 on 6/29.

tjablin avatar Aug 04 '22 16:08 tjablin

Sorry, I missed PR1168 (submission checker change) 5 weeks ago. But PR was only two weeks ago.

According to the attached slide from the WG meeting slide deck 4 weeks ago, the rule freeze should have happened 10-11 weeks ago and code freeze 8-9 weeks ago:

week - 11 05/20/2022 Power rule freeze, power tool freeze.
week - 11 05/20/2022 Inference rule freeze, non-model / measurement methodology freeze
-- -- --
week - 9 06/03/2022 Code freeze (functionality freeze)Includes automated submission checker
-- -- --

Isn't there a contradiction?

psyhtest avatar Aug 04 '22 16:08 psyhtest

We went by what we saw in Slides. We did miss the change made as late as 7/21 https://github.com/mlcommons/inference_policies/pull/251. And ran into submission checker issue just before making submission.

What are options to make submission with Target FP32 mAP = 0.375?

Can't work 24 hrs before submissions to get the updated accuracy now as we have several submission tasks ongoing at this stage.

Need Waiver ASAP for this submission.

nitinqti avatar Aug 04 '22 16:08 nitinqti

Where do we draw a line between cases like this and (non-WG approved) late changes like https://github.com/mlcommons/inference/pull/1206?

psyhtest avatar Aug 04 '22 16:08 psyhtest

This is the accuracy we got with the reference implementation (fp32) using onnxruntime backend. Not sure why but mAP is lower than 37.575% ( 37.572% )

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.376
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.525
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.406
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.025
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.127
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.415
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.420
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.598
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.627
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.082
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.341
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.677
mAP=37.572%

arjunsuresh avatar Aug 05 '22 08:08 arjunsuresh

Thank you Arjun. Your accuracy is still higher than 37.55%.

psyhtest avatar Aug 06 '22 17:08 psyhtest

My bad. Sorry I was careless and misread the decimal points. Below are the official figures as in r2.1 branch

  1. Pytorch model: 0.3755
  2. Onnx model: 0.3757 (0.37572 reproduced by us)

Inference Policies say the accuracy requirement as 99% of FP32 model, but does not clearly say which exact model (lowest accuracy one?) nor say up to how many digits after the decimal point. If submitter are to follow the exact figure given there, it is bad that master branch still has 37.5 given there and not 37.55 and 37.57 as in r2.1 branch. r2.1 is the branch used for submission but master branch should not have wrong/misleading information. I hope such issues are avoided going forward while changes are made.

arjunsuresh avatar Aug 06 '22 18:08 arjunsuresh

Inference Policies say the accuracy requirement as 99% of FP32 model, but does not clearly say which exact model (lowest accuracy one?) nor say up to how many digits after the decimal point

Actually, one part of the rules says:

Accuracy results must be reported to five significant figures with round to even. For example, 98.9995% should be recorded as 99.000%.

So neither 37.5% (3 significant digits) nor 37.55% (4 significant digits) is correct.

psyhtest avatar Aug 16 '22 18:08 psyhtest

This issue is resolved during v2.1 result review meetings.

rnaidu02 avatar Oct 11 '22 23:10 rnaidu02