datumaro icon indicating copy to clipboard operation
datumaro copied to clipboard

Class id mismatch when working with COCO starting from class_id == 0.

Open MihailMihaylov97 opened this issue 2 years ago • 4 comments

Hi all, I have encounterd a strange behavior when I am working with a COCO style dataset and my class ids start from 0. When exporting a COCO style dataset which starts from category_id == 0 (like the example below): Input:

{
    "licenses": [],
    "info": {},
    "categories": [
        {
            "id": 0,
            "name": "A",
            "supercategory": ""
        },
        {
            "id": 1,
            "name": "B",
            "supercategory": ""
        }
        
    ],
    "images": [
        {
            "id": 1,
            "width": 512,
            "height": 512,
            "file_name": "test.png",
            "license": 0,
            "flickr_url": "",
            "coco_url": "",
            "date_captured": 0
        }
    ],
    "annotations": [

        {
            "id": 1,
            "image_id": 1,
            "category_id": 0,
            "segmentation": [],
            "area": 12965.0,
            "bbox": [
                173.0,
                142.0,
                156.0,
                109.0
            ],
            "iscrowd": 0
        }
    ]
}

The resulting COCO style dataset gets category_id shifted by 1 (see example 2) while all annotation ids also get shifted by 1, except for annotation id 0.

Output:

{
    "licenses": [],
    "info": {},
    "categories": [
        {
            "id": 1,
            "name": "A",
            "supercategory": ""
        },
        {
            "id": 2,
            "name": "B",
            "supercategory": ""
        }
        
    ],
    "images": [
        {
            "id": 1,
            "width": 512,
            "height": 512,
            "file_name": "test.png",
            "license": 0,
            "flickr_url": "",
            "coco_url": "",
            "date_captured": 0
        }
    ],
    "annotations": [

        {
            "id": 1,
            "image_id": 2,
            "category_id": 0,
            "segmentation": [],
            "area": 12965.0,
            "bbox": [
                173.0,
                142.0,
                156.0,
                109.0
            ],
            "iscrowd": 0
        }
    ]
}

Expected behavior -> all class ids should remain unchanged. So in this case

...
 "categories": [
        {
            "id": 0,
            "name": "A",
            "supercategory": ""
        },
        {
            "id": 1,
            "name": "B",
            "supercategory": ""
        }],
"annotations": [
        {
            "id": 1,
            "image_id": 2,
            "category_id": 0,
            "segmentation": [],
            "area": 12965.0,
            "bbox": [
                173.0,
                142.0,
                156.0,
                109.0
            ],
            "iscrowd": 0
        }
    ]

NB: I did not manage to find an option to leave label_ids unchanged. Any help would be much appreciated!

MihailMihaylov97 avatar Sep 19 '22 16:09 MihailMihaylov97

Hi, currently Datumaro considers id = 0 as "no label", because this id is not used in the original dataset. There are no options in API or CLI to tweak this behavior. In the code it is quite simple to change, however. The implementation is here (writing) and here (reading). As a quick solution, I can suggest changing 0 labels to some other number in the annotation files.

zhiltsov-max avatar Sep 20 '22 13:09 zhiltsov-max

Thank you for the quick response @zhiltsov-max !

Indeed COCO assumes class_id == 0 as background / no label, however we have noticed class_id == 0 being used in other datasets.

We believe Datumaro creates an identity transformation accross datasets. In our case we do not have control over the labelling, so we would need to preserve the input labels.

  1. Is class_id == 0 as background / no label a general assumption in the whole library, e.g. PASCAL VOC / tfrecords / etc, or is it just COCO specific?
  2. In case it is not a restriction, are you amendable to create a flag to preserve original class_ids / labels. Since this is on our critical path, we would be more than happy to help out with a PR.

Thank you!

MihailMihaylov97 avatar Sep 20 '22 16:09 MihailMihaylov97

Hi @zhiltsov-max @MihailMihaylov97 This issue may explain my previous experience of Datumaro messing up the categories/annotations forcing me to re-write to a custom converter. Is there any description of this behavior of changing the ids in the documentation?

I.e. Running datum convert -if coco -i <path/to/ds> -f voc -o <output/dir> or the reverse datum convert -if voc -i <path/to/ds> -f coco -o <output/dir> I would have not expected any reshuffling to happen, not to mention implicitly without a warning. If I totally avoid coco e.g. datum convert -if cityscapes -i <path/to/ds> -f voc -o <output/dir> is the 0-based assumption still holding?

23pointsNorth avatar Sep 25 '22 09:09 23pointsNorth

@MihailMihaylov97, @23pointsNorth,

we have noticed class_id == 0 being used in other datasets.

Do you mean other COCO-like datasets? I think, we could allow to choose the no label id as a solution. Probably, it could be just a switch between '0' and '-1', or some other number. Do you know how such labels are represented in those cases?

We believe Datumaro creates an identity transformation accross datasets. In our case we do not have control over the labelling, so we would need to preserve the input labels.

Datumaro is trying its best to make input and output datasets compatible. It's not exactly identity even if the same format is used for reading and writing. The only format in Datumaro which guarantees identity is our own datumaro internal format, all the rest are projected onto this one.

Is class_id == 0 as background / no label a general assumption in the whole library, e.g. PASCAL VOC / tfrecords / etc, or is it just COCO specific?

In case it is not a restriction, are you amendable to create a flag to preserve original class_ids / labels. Since this is on our critical path, we would be more than happy to help out with a PR.

If I totally avoid coco e.g. datum convert -if cityscapes -i <path/to/ds> -f voc -o <output/dir> is the 0-based assumption still holding?

It is COCO-specific. Datumaro works with id 0 the same way as with any other number. However, when it comes to segmentation masks export (as images), the general assumption is: the class with color (0,0,0), or the background class, or the existing id 0 is counted as background. It is usually can be modified by providing a custom color/label map in formats like VOC and CamVid for export using the --label_map <filename> option. For COCO, I don't see big problems in changing this logic, the implementation is quite straightforward in this regard both for reading and writing. The implementations links are here.

I'm not working on this project anymore (this repository, at least), so I can't comment whether external PR are acceptable.

This issue may explain my previous experience of Datumaro messing up the categories/annotations forcing me to re-write to a custom converter. Is there any description of this behavior of changing the ids in the documentation?

Yes, this topic is covered in 2 notes in the format description here: https://openvinotoolkit.github.io/datumaro/docs/formats/coco/#import-coco-dataset

There is the --keep-original-ids import option, which allows to preserve original ids (don't reindex by sequential numbers), but this doesn't affect the id 0 problem. Please check if it could be useful in your cases. It will affect all cases when you export from COCO to something else.

zhiltsov-max avatar Sep 28 '22 17:09 zhiltsov-max