inference
inference copied to clipboard
Latest OpenImages dataset annotations differ from those generated by previous fiftyone based script
For some images, the number of annotations generated using the latest script differ from that generated using the fiftyone python package. A couple of examples attached. New seems to produce a subset? We've also seen a difference in annotations when generated on two separate machines. As yet to compare environment for those two machines.
Anyone else experiencing this?
@G4V Is the end result a poor accuracy number?
I'm not sure if it is related but I had used the new script (never tried with the old one) and then did inference using Nvidia submission code and saw this issue. @nv-ananjappa
The scripts were indeed checked and verified at least for the reference implementation (I think I had also tested this). But since the scripts are a cosmetic change (the dataset remaining the same), if it is causing problem, you can always use the old one from 2.1 submission round right?
@G4V What command are you using to download the dataset? Make sure you use the command: ./openimages_mlperf -d <DOWNLOAD_PATH>
leaving the -m
argument as None
. That argument was only added for testing/developing purposes (if you wanted to test the benchmark with a smaller subset)
Hi @pgmpablo157321,
I'm able to generate the entire dataset; it's just that some of the images have a differing number of detections to the annotations generated by the fiftyone package and is throwing out our accuracy.
In our scripts we're launching it as you describe -
https://github.com/krai/ck-mlperf/blob/master/package/dataset-openimages-for-object-detection/install.sh#L18
The annotations for 1366cde3b480a15c.jpg highlight the problem -
Fiftyone generates four boxes and the new script, only the one. Also running on two different machines, the single boxes differ. All very odd.
Could you check the annotations you're generating for this image?
Just adding a data point here. This is on an Macbook Pro M1 system using onnxruntime backend over the entire dataset using the reference implementation. We see accuracy as 36.650
where as the official number is 37.57
. Not sure whether it is due to being a different system.
DONE (t=82.59s).
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.367
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.512
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.394
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.024
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.113
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.404
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.421
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.596
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.626
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.083
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.340
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.673
mAP=36.650%
@G4V If you are using this script for preprocessing, can you please try with threads=1
here ?
Thanks @arjunsuresh. That's the order of reduction in accuracy we're experiencing using the updated script. I'll give threads=1 a try. Are you seeing an improvement with this?
@G4V If you are using this script for preprocessing, can you please try with
threads=1
here ?
Ah, the above is the script to generate the pre-processed images. The issue we're seeing is with the step before to generate the annotations that sit alongside the original images. Script here -
https://github.com/mlcommons/inference/blob/master/vision/classification_and_detection/tools/openimages.py
You're welcome Gavin. But please ignore my suggestion of threads=1
- I see that threads is not used in the function anyway - it is just there to keep compatibility with the imagenet script. Openimages preprocessing is serial whereas imagenet preprocessing is done in parallel using the reference script. It takes around 8 hours for the full accuracy run on M1 -- so I cannot try things easily too.
@G4V Thank you for clarifying. If it is not a big concern please use the old script for your submissions. This doesn't look like an easy fix (all accuracy issues usually takes time)
@pgmpablo157321 Just to be sure, when the scripts were updated did we check it on the entire dataset for accuracy? We have been testing retinanet a lot ourselves but all were using a reduced dataset (which was one of the options which came with the modification).
@G4V But this can potentially help - using num_processes=1 here. If you are having a fast GPU this can be a quick test.
https://github.com/mlcommons/inference/blob/master/vision/classification_and_detection/tools/openimages.py#L88
@pgmpablo157321 Just to be sure, when the scripts were updated did we check it on the entire dataset for accuracy? We have been testing retinanet a lot ourselves but all were using a reduced dataset (which was one of the options which came with the modification).
@arjunsuresh Yes, that is correct, I tested the reference implementation and it had the same accuracy. I'll run the benchmark accuracy again just to be sure.
The annotations for 1366cde3b480a15c.jpg highlight the problem
Fiftyone generates four boxes and the new script, only the one. Also running on two different machines, the single boxes differ. All very odd.
Could you check the annotations you're generating for this image?
@G4V This is what I get using the current (3.0) script:
Image info:
{'id': 6479, 'file_name': '1366cde3b480a15c.jpg', 'height': 4320, 'width': 2432, 'license': None, 'coco_url': None}, {'id': 6480, 'file_name': '13690841e89135f7.jpg', 'height': 1024, 'width': 925, 'license': None, 'coco_url': None}
Boxes info:
{'id': 13704, 'image_id': 6479, 'category_id': 117, 'bbox': [1268.7263436800001, 260.2409688, 883.16528384, 3580.9157160000004], 'area': 3162540.444728257, 'iscrowd': 0, 'IsOccluded': 0, 'IsInside': 0, 'IsDepiction': 1, 'IsTruncated': 0, 'IsGroupOf': 1}
{'id': 25159, 'image_id': 6479, 'category_id': 148, 'bbox': [978.73172096, 249.83132400000002, 1150.0920704, 3632.96394], 'area': 4178243.0194431418, 'iscrowd': 0, 'IsOccluded': 1, 'IsInside': 0, 'IsDepiction': 0, 'IsTruncated': 0, 'IsGroupOf': 0}
{'id': 41129, 'image_id': 6479, 'category_id': 125, 'bbox': [1384.0650624000002, 1020.1445424, 207.60962559999984, 853.5904416000001], 'area': 177213.59199631453, 'iscrowd': 0, 'IsOccluded': 1, 'IsInside': 0, 'IsDepiction': 0, 'IsTruncated': 0, 'IsGroupOf': 0}
{'id': 41130, 'image_id': 6479, 'category_id': 125, 'bbox': [1430.2005887999999, 2727.325296, 177.9511424000001, 905.6384928000002], 'area': 161159.4043951743, 'iscrowd': 0, 'IsOccluded': 1, 'IsInside': 0, 'IsDepiction': 0, 'IsTruncated': 0, 'IsGroupOf': 0}
I get 4 different boxes with the 3.0 script. It assigns the image to the id 6479 and there are for boxes that belong to this image_id
@pgmpablo157321 Thank you for confirming. Is it that the box ids for an image are adjacent for the old script but they are not necessarily so for the new one? I'm also seeing 4 boxes for image_id=6479 for the current script.
yes, I see that the 4 boxes are not adjacent
yes, I see that the 4 boxes are not adjacent
Ah, ok, that's what threw me. I'll need to dig a bit further into why we're seeing the difference in accuracy measurement.
@arjunsuresh any ideas from your side on this?
Nothing clicking as of now -- need a sleep :) But since @pgmpablo157321 confirmed that he got the expected accuracy and I'm getting lower accuracy on aarch64 (I'll try a run on x86 overnight) using the same reference implementation - we can conclude that the issue has nothing related to any internal preprocessing you might be using. It could be architecture difference (less likely), or some python dependency version change. If this was for resnet50 I could have tried all the possibilities easily due to the short runtime. Here, I'll try if we can replicate the issue on a small dataset size (6-7 hours for a single run is not feasible) and if so in a day or two I should be able to report the culprit.
Also, sorting the annotations based on image_id might be a solution right?
Thanks @arjunsuresh. The only difference for us between accuracy calcs is the annotations file (I think). Shall dig further. Sorting the annotations will give another good data point.
Just ran a couple of tests and I get there is a very small difference between both sets. For some reason, either this implementation or the previous one swaps the dimensions of the image 1366cde3b480a15c.jpg
. However this should be negligible for the metric since it only has 4 boxes out of 158642.
Specifically what I did was:
- Sort the annotations by
image_id
(in the current implementation, the previous one was already sorted) - Iterate over the images and check if they have the same height and width
- Iterate over the annotations, group them by image_id and compare the obtained groups
@G4V how did you find that specific image?
@G4V how did you find that specific image?
Luck. I hadn't realised that the boxes for this specific image differed from those produced by the previous script, only that I thought the boxes were a subset as not contiguously listed in the json.
Agree that all other boxes are the same barring the four. The accuracy issue is our end, I think but not yet concluded.
@G4V you should try a lottery 😁
@G4V @arjunsuresh I got a reduction in accuracy as well:
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.366
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.512
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.394
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.024
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.113
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.404
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.420
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.595
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.623
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.076
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.333
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.671
TestScenario.SingleStream qps=45.52, mean=0.1832, time=544.420, acc=40.478%, mAP=36.634%, queries=24781, tiles=50.0:0.1833,80.0:0.1881,90.0:0.1905,95.0:0.1924,99.0:0.1962,99.9:0.2082
Thanks @pgmpablo157321 So M1 gave a slightly better accuracy 36.650%.
Do you know what exactly has changed since the last time you got 37.57?
TL;DR: fiftyone==0.16.5 mlperf-inference-source==2.1 gets things back in shape.
Rather unhelpfully, fiftyone introduced a new 0.19.0 release just a few days ago, which seems to break downloads even with the r2.1 branch. I think 0.18.0 should work too, as we had no download issues until February, but I've only tested 0.16.5 so far.
Thank you @psyhtest And if we use the annotations file produced and then call this accuracy script, we can expect 37.57% mAP right?
I made another two runs, and these are the results I got. First, I ran the object detection benchmark with Inference 3.0 annotations and 2.1 code and I got:
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.366
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.512
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.394
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.024
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.113
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.404
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.420
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.595
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.623
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.076
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.333
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.671
TestScenario.SingleStream qps=43.34, mean=0.1822, time=571.734, acc=40.478%, mAP=36.634%, queries=24781, tiles=50.0:0.1824,80.0:0.1875,90.0:0.1900,95.0:0.1919,99.0:0.1958,99.9:0.2050
Then I ran the benchmark with Inference 2.1 annotations and 3.0 code:
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.376
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.524
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.406
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.025
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.127
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.415
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.420
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.596
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.623
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.075
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.334
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.675
TestScenario.SingleStream qps=43.32, mean=0.1816, time=572.001, acc=40.478%, mAP=37.550%, queries=24781, tiles=50.0:0.1818,80.0:0.1868,90.0:0.1891,95.0:0.1910,99.0:0.1947,99.9:0.2018
So it seems the four boxes are responsible for the difference in mAP (I don't completely understand how). These issue should be solved for now by taking the annotations from this release
@pgmpablo157321 That's useful information. Just to be sure, can we manually edit the annotations file from r2.1 - to modify just the 4 boxes like the annotations file of r3.0 and see what accuracy we can get? This can tell us if it is really the boxes or the different ordering that is causing the accuracy difference.
@arjunsuresh I think you can do that, but also keep in mind that the dimensions of the image 1366cde3b480a15c.jpg
were swapped as well. So that might also affect the results
@pgmpablo157321 I've run with the known good 2.1 annotations file but with the boxes and dimensions modified for 1366cde3b480a15c.jpg, and I'm not seeing a change in accuracy. Could you try this also and confirm that you see the same?
If so, and everything else being equal, this seems to imply that the accuracy calc is (erroneously) tied to the order of images in the annotations file?
@G4V I was thinking the same but could not try it as I just got a system. I could not find anything suspicious with the accuracy script it does have this written.
@pgmpablo157321 In the dataset download script with count option like -m 50
, the script is downloading 50 random
images. Is there any reason to include this randomness
? If not can you please remove this as then we can easily compare the accuracy of smaller dataset runs.
By replacing the annotations file we are also seeing the expected accuracy. But still not sure of the real reason of the problem.
TestScenario.Offline qps=158.26, mean=11.1613, time=156.587, acc=41.033%, mAP=37.572%, queries=24781, tiles=50.0:10.4530,80.0:14.5974,90.0:14.8929,95.0:15.0823,99.0:15.4117,99.9:24.6361
CM run command used
cm run script --tags=generate-run-cmds --execution-mode=valid --model=retinanet \
--mode=accuracy --adr.openimages-preprocessed.tags=_full,_custom-annotations