inference Latest OpenImages dataset annotations differ from those generated by previous fiftyone based script

For some images, the number of annotations generated using the latest script differ from that generated using the fiftyone python package. A couple of examples attached. New seems to produce a subset? We've also seen a difference in annotations when generated on two separate machines. As yet to compare environment for those two machines.

Anyone else experiencing this?

49cc9e3699b6a53e.jpg.txt 1366cde3b480a15c.jpg.txt

Feb 12 '23 17:02 G4V

@G4V Is the end result a poor accuracy number?

I'm not sure if it is related but I had used the new script (never tried with the old one) and then did inference using Nvidia submission code and saw this issue. @nv-ananjappa

The scripts were indeed checked and verified at least for the reference implementation (I think I had also tested this). But since the scripts are a cosmetic change (the dataset remaining the same), if it is causing problem, you can always use the old one from 2.1 submission round right?

Feb 12 '23 19:02 arjunsuresh

@G4V What command are you using to download the dataset? Make sure you use the command: ./openimages_mlperf -d <DOWNLOAD_PATH> leaving the -m argument as None. That argument was only added for testing/developing purposes (if you wanted to test the benchmark with a smaller subset)

Feb 13 '23 13:02 pgmpablo157321

Hi @pgmpablo157321,

I'm able to generate the entire dataset; it's just that some of the images have a differing number of detections to the annotations generated by the fiftyone package and is throwing out our accuracy.

In our scripts we're launching it as you describe -

https://github.com/krai/ck-mlperf/blob/master/package/dataset-openimages-for-object-detection/install.sh#L18

The annotations for 1366cde3b480a15c.jpg highlight the problem -

1366cde3b480a15c.jpg.txt

Fiftyone generates four boxes and the new script, only the one. Also running on two different machines, the single boxes differ. All very odd.

Could you check the annotations you're generating for this image?

Feb 13 '23 14:02 G4V

Just adding a data point here. This is on an Macbook Pro M1 system using onnxruntime backend over the entire dataset using the reference implementation. We see accuracy as 36.650 where as the official number is 37.57. Not sure whether it is due to being a different system.

DONE (t=82.59s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.367
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.512
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.394
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.024
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.113
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.404
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.421
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.596
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.626
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.083
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.340
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.673
mAP=36.650%

Feb 13 '23 15:02 arjunsuresh

@G4V If you are using this script for preprocessing, can you please try with threads=1 here ?

Feb 13 '23 16:02 arjunsuresh

Thanks @arjunsuresh. That's the order of reduction in accuracy we're experiencing using the updated script. I'll give threads=1 a try. Are you seeing an improvement with this?

Feb 13 '23 16:02 G4V

@G4V If you are using this script for preprocessing, can you please try with threads=1 here ?

Ah, the above is the script to generate the pre-processed images. The issue we're seeing is with the step before to generate the annotations that sit alongside the original images. Script here -

https://github.com/mlcommons/inference/blob/master/vision/classification_and_detection/tools/openimages.py

Feb 13 '23 16:02 G4V

You're welcome Gavin. But please ignore my suggestion of threads=1 - I see that threads is not used in the function anyway - it is just there to keep compatibility with the imagenet script. Openimages preprocessing is serial whereas imagenet preprocessing is done in parallel using the reference script. It takes around 8 hours for the full accuracy run on M1 -- so I cannot try things easily too.

Feb 13 '23 16:02 arjunsuresh

@G4V Thank you for clarifying. If it is not a big concern please use the old script for your submissions. This doesn't look like an easy fix (all accuracy issues usually takes time)

@pgmpablo157321 Just to be sure, when the scripts were updated did we check it on the entire dataset for accuracy? We have been testing retinanet a lot ourselves but all were using a reduced dataset (which was one of the options which came with the modification).

Feb 13 '23 16:02 arjunsuresh

@G4V But this can potentially help - using num_processes=1 here. If you are having a fast GPU this can be a quick test.

https://github.com/mlcommons/inference/blob/master/vision/classification_and_detection/tools/openimages.py#L88

Feb 13 '23 17:02 arjunsuresh

@pgmpablo157321 Just to be sure, when the scripts were updated did we check it on the entire dataset for accuracy? We have been testing retinanet a lot ourselves but all were using a reduced dataset (which was one of the options which came with the modification).

@arjunsuresh Yes, that is correct, I tested the reference implementation and it had the same accuracy. I'll run the benchmark accuracy again just to be sure.

The annotations for 1366cde3b480a15c.jpg highlight the problem

1366cde3b480a15c.jpg.txt

Fiftyone generates four boxes and the new script, only the one. Also running on two different machines, the single boxes differ. All very odd.

Could you check the annotations you're generating for this image?

@G4V This is what I get using the current (3.0) script:

Image info:

{'id': 6479, 'file_name': '1366cde3b480a15c.jpg', 'height': 4320, 'width': 2432, 'license': None, 'coco_url': None}, {'id': 6480, 'file_name': '13690841e89135f7.jpg', 'height': 1024, 'width': 925, 'license': None, 'coco_url': None}

Boxes info:

{'id': 13704, 'image_id': 6479, 'category_id': 117, 'bbox': [1268.7263436800001, 260.2409688, 883.16528384, 3580.9157160000004], 'area': 3162540.444728257, 'iscrowd': 0, 'IsOccluded': 0, 'IsInside': 0, 'IsDepiction': 1, 'IsTruncated': 0, 'IsGroupOf': 1}

{'id': 25159, 'image_id': 6479, 'category_id': 148, 'bbox': [978.73172096, 249.83132400000002, 1150.0920704, 3632.96394], 'area': 4178243.0194431418, 'iscrowd': 0, 'IsOccluded': 1, 'IsInside': 0, 'IsDepiction': 0, 'IsTruncated': 0, 'IsGroupOf': 0}

{'id': 41129, 'image_id': 6479, 'category_id': 125, 'bbox': [1384.0650624000002, 1020.1445424, 207.60962559999984, 853.5904416000001], 'area': 177213.59199631453, 'iscrowd': 0, 'IsOccluded': 1, 'IsInside': 0, 'IsDepiction': 0, 'IsTruncated': 0, 'IsGroupOf': 0}

{'id': 41130, 'image_id': 6479, 'category_id': 125, 'bbox': [1430.2005887999999, 2727.325296, 177.9511424000001, 905.6384928000002], 'area': 161159.4043951743, 'iscrowd': 0, 'IsOccluded': 1, 'IsInside': 0, 'IsDepiction': 0, 'IsTruncated': 0, 'IsGroupOf': 0}

I get 4 different boxes with the 3.0 script. It assigns the image to the id 6479 and there are for boxes that belong to this image_id

Feb 13 '23 17:02 pgmpablo157321

@pgmpablo157321 Thank you for confirming. Is it that the box ids for an image are adjacent for the old script but they are not necessarily so for the new one? I'm also seeing 4 boxes for image_id=6479 for the current script.

Feb 13 '23 17:02 arjunsuresh

yes, I see that the 4 boxes are not adjacent

Feb 13 '23 18:02 pgmpablo157321

yes, I see that the 4 boxes are not adjacent

Ah, ok, that's what threw me. I'll need to dig a bit further into why we're seeing the difference in accuracy measurement.

@arjunsuresh any ideas from your side on this?

Feb 13 '23 19:02 G4V

Nothing clicking as of now -- need a sleep :) But since @pgmpablo157321 confirmed that he got the expected accuracy and I'm getting lower accuracy on aarch64 (I'll try a run on x86 overnight) using the same reference implementation - we can conclude that the issue has nothing related to any internal preprocessing you might be using. It could be architecture difference (less likely), or some python dependency version change. If this was for resnet50 I could have tried all the possibilities easily due to the short runtime. Here, I'll try if we can replicate the issue on a small dataset size (6-7 hours for a single run is not feasible) and if so in a day or two I should be able to report the culprit.

Also, sorting the annotations based on image_id might be a solution right?

Feb 13 '23 19:02 arjunsuresh

Thanks @arjunsuresh. The only difference for us between accuracy calcs is the annotations file (I think). Shall dig further. Sorting the annotations will give another good data point.

Feb 13 '23 19:02 G4V

Just ran a couple of tests and I get there is a very small difference between both sets. For some reason, either this implementation or the previous one swaps the dimensions of the image 1366cde3b480a15c.jpg. However this should be negligible for the metric since it only has 4 boxes out of 158642.

Specifically what I did was:

Sort the annotations by image_id (in the current implementation, the previous one was already sorted)
Iterate over the images and check if they have the same height and width
Iterate over the annotations, group them by image_id and compare the obtained groups

@G4V how did you find that specific image?

Feb 14 '23 02:02 pgmpablo157321

@G4V how did you find that specific image?

Luck. I hadn't realised that the boxes for this specific image differed from those produced by the previous script, only that I thought the boxes were a subset as not contiguously listed in the json.

Agree that all other boxes are the same barring the four. The accuracy issue is our end, I think but not yet concluded.

Feb 14 '23 16:02 G4V

@G4V you should try a lottery 😁

Feb 14 '23 16:02 arjunsuresh

@G4V @arjunsuresh I got a reduction in accuracy as well:

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.366
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.512
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.394
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.024
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.113
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.404
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.420
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.595
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.623
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.076
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.333
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.671
TestScenario.SingleStream qps=45.52, mean=0.1832, time=544.420, acc=40.478%, mAP=36.634%, queries=24781, tiles=50.0:0.1833,80.0:0.1881,90.0:0.1905,95.0:0.1924,99.0:0.1962,99.9:0.2082

Feb 14 '23 17:02 pgmpablo157321

Thanks @pgmpablo157321 So M1 gave a slightly better accuracy 36.650%.

Do you know what exactly has changed since the last time you got 37.57?

Feb 14 '23 17:02 arjunsuresh

TL;DR: fiftyone==0.16.5 mlperf-inference-source==2.1 gets things back in shape.

Rather unhelpfully, fiftyone introduced a new 0.19.0 release just a few days ago, which seems to break downloads even with the r2.1 branch. I think 0.18.0 should work too, as we had no download issues until February, but I've only tested 0.16.5 so far.

Feb 14 '23 18:02 psyhtest

Thank you @psyhtest And if we use the annotations file produced and then call this accuracy script, we can expect 37.57% mAP right?

Feb 14 '23 18:02 arjunsuresh

I made another two runs, and these are the results I got. First, I ran the object detection benchmark with Inference 3.0 annotations and 2.1 code and I got:

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.366
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.512
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.394
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.024
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.113
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.404
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.420
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.595
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.623
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.076
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.333
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.671
TestScenario.SingleStream qps=43.34, mean=0.1822, time=571.734, acc=40.478%, mAP=36.634%, queries=24781, tiles=50.0:0.1824,80.0:0.1875,90.0:0.1900,95.0:0.1919,99.0:0.1958,99.9:0.2050

Then I ran the benchmark with Inference 2.1 annotations and 3.0 code:

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.376
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.524
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.406
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.025
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.127
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.415
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.420
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.596
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.623
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.075
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.334
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.675
TestScenario.SingleStream qps=43.32, mean=0.1816, time=572.001, acc=40.478%, mAP=37.550%, queries=24781, tiles=50.0:0.1818,80.0:0.1868,90.0:0.1891,95.0:0.1910,99.0:0.1947,99.9:0.2018

So it seems the four boxes are responsible for the difference in mAP (I don't completely understand how). These issue should be solved for now by taking the annotations from this release

Feb 17 '23 14:02 pgmpablo157321

@pgmpablo157321 That's useful information. Just to be sure, can we manually edit the annotations file from r2.1 - to modify just the 4 boxes like the annotations file of r3.0 and see what accuracy we can get? This can tell us if it is really the boxes or the different ordering that is causing the accuracy difference.

Feb 17 '23 15:02 arjunsuresh

@arjunsuresh I think you can do that, but also keep in mind that the dimensions of the image 1366cde3b480a15c.jpg were swapped as well. So that might also affect the results

Feb 17 '23 15:02 pgmpablo157321

@pgmpablo157321 I've run with the known good 2.1 annotations file but with the boxes and dimensions modified for 1366cde3b480a15c.jpg, and I'm not seeing a change in accuracy. Could you try this also and confirm that you see the same?

If so, and everything else being equal, this seems to imply that the accuracy calc is (erroneously) tied to the order of images in the annotations file?

Feb 18 '23 13:02 G4V

@G4V I was thinking the same but could not try it as I just got a system. I could not find anything suspicious with the accuracy script it does have this written.

Feb 18 '23 22:02 arjunsuresh

@pgmpablo157321 In the dataset download script with count option like -m 50, the script is downloading 50 random images. Is there any reason to include this randomness? If not can you please remove this as then we can easily compare the accuracy of smaller dataset runs.

Feb 19 '23 13:02 arjunsuresh

By replacing the annotations file we are also seeing the expected accuracy. But still not sure of the real reason of the problem.

TestScenario.Offline qps=158.26, mean=11.1613, time=156.587, acc=41.033%, mAP=37.572%, queries=24781, tiles=50.0:10.4530,80.0:14.5974,90.0:14.8929,95.0:15.0823,99.0:15.4117,99.9:24.6361

CM run command used

cm run script --tags=generate-run-cmds --execution-mode=valid --model=retinanet \
--mode=accuracy  --adr.openimages-preprocessed.tags=_full,_custom-annotations

Feb 20 '23 01:02 arjunsuresh

inference inference copied to clipboard

Latest OpenImages dataset annotations differ from those generated by previous fiftyone based script

inference
inference copied to clipboard