transformers Confusion around correct bounding box format for DETR training

In the object detection guide there are a few bounding box formats mentioned.

For preprocessing, the guide suggests that the bounding boxes should be in COCO format as this what the DETR model expects. There is even a link to the albumentations documentation for the COCO definition, which defines the COCO format as "four values in pixels [x_min, y_min, width, height]. They are coordinates of the top-left corner along with the width and height of the bounding box."

The guide defines a custom function augment_and_transform_batch to apply augmentations and then format the annotations. The formatting of the annotations is done with another custom function format_image_annotations_as_coco, but in the function's docstring, it is stated that the bbox argument should be a list of bounding boxes provided in COCO format ([center_x, center_y, width, height] in absolute coordinates). The confusion begins here as [center_x, center_y, width, height] is actually the YOLO format (except YOLO is represented by normalized coordinates... more confusion). This function does not actually apply any transformations to the bounding boxes and, according to the example in this guide, the format of the bbox inputs are actually in COCO format. So, I assumed this was a typo and that "center_x, center_y" was supposed to be "x_min, y_min".

However, further down in the guide, the compute_metrics function converts the target bounding boxes from YOLO to Pascal VOC. This function expects the bounding boxes in "YOLO format (x_center, y_center, width, height) in range [0, 1]".

bbox_formats

So my question is, in what format does the model actually expect the input bounding boxes to be? The guide as it is now, says that the bounding boxes should be in COCO format (with absolute pixel values), but then during evaluation converts these bounding boxes to Pascal VOC thinking they are in YOLO format (and normalized).

Can you please update the guide to make this more clear?

Aug 15 '24 15:08 tkh5044

Thank you for pointing out the confusion in the object detection guide regarding the different bounding box formats. I'll clarify the format expectations and update you accordingly:

Preprocessing and Model Input Format: The DETR model, as mentioned in the guide, expects the input bounding boxes to be in COCO format during preprocessing and model training/inference. The COCO format is defined as [x_min, y_min, width, height], where the coordinates are in absolute pixel values. This is the format that should be used when preparing your data for the DETR model.
Custom Function format_image_annotations_as_coco: The docstring for this function is incorrect. It should state that the bbox argument should be in COCO format ([x_min, y_min, width, height] in absolute coordinates) instead of YOLO format. This function does not perform any transformations on the bounding boxes and expects the input bboxes to be in COCO format.
Evaluation and Metric Computation: During evaluation, the bounding boxes should remain in COCO format (absolute pixel values) for consistency. The compute_metrics function should not convert the target bounding boxes from YOLO to Pascal VOC format. The function should expect the bounding boxes in COCO format ([x_min, y_min, width, height]) for proper evaluation.

To summarize, the DETR model in the object detection guide expects the input bounding boxes to be in COCO format ([x_min, y_min, width, height] in absolute pixel coordinates) throughout the pipeline, from preprocessing to evaluation. The guide will be updated to reflect this clarity and ensure consistency in the bounding box format expectations.

I appreciate your feedback,

Aug 16 '24 08:08 sadafwalliyani

Hi, thank you for your response! This is very helpful, thank you for clarifying. So we have concluded the following updates should be made:

the docstring in format_image_annotations_as_coco is a typo and should be updated to state that the bbox argument should be in COCO format ([x_min, y_min, width, height] in absolute coordinates)
the convert_bbox_yolo_to_pascal transformation should be removed for this example, which means the supporting comments should be updated, as well, to convey that the target bboxes should be in COCO (and do not need to be converted to another format)
additionally, the box_format parameter passed to MeanAveragePrecision should be "xywh" instead of "xyxy"

Is that correct?

Aug 16 '24 16:08 tkh5044

Yes, that is correct! Here's a summary of the updates that should be made to the object detection guide:

Update the docstring in the format_image_annotations_as_coco function to state that the bbox argument should be in COCO format ([x_min, y_min, width, height] in absolute coordinates). This is consistent with the expected input format for the DETR model.
Remove the convert_bbox_yolo_to_pascal transformation from the example code. This transformation is unnecessary since the bounding boxes are already in COCO format. Update the supporting comments to clarify that the target bounding boxes should remain in COCO format and do not need to be converted.
Change the box_format parameter passed to MeanAveragePrecision to "xywh" instead of "xyxy". This reflects the correct format of the bounding boxes, which is [x_min, y_min, width, height].

These updates will ensure that the guide accurately reflects the expected bounding box formats and removes any confusion or inconsistencies. Thank you for your careful review and feedback!

Aug 16 '24 19:08 sadafwalliyani

I am still lost. The DetrImageProcessor formats the annotations to the YOLO format, and the documentation states that the DETR model expects this format.

From the documentation: do_convert_annotations (bool, optional, defaults to True) — Controls whether to convert the annotations to the format expected by the DETR model. Converts the bounding boxes to the format (center_x, center_y, width, height) and in the range [0, 1]. Can be overridden by the do_convert_annotations parameter in the preprocess method.

Aug 16 '24 20:08 tkh5044

I apologize for the ongoing confusion regarding the expected bounding box format in the object detection guide. Let's clarify this once and for all:

The DETR model, as mentioned in the documentation, does expect the bounding boxes to be in a specific format during preprocessing. The correct format expected by the DETR model is indeed the YOLO format, which is represented as [center_x, center_y, width, height] with values normalized to the range [0, 1]. This is the format that the DetrImageProcessor class helps to achieve through the do_convert_annotations parameter.

Aug 17 '24 19:08 sadafwalliyani

The DETR model expects the input bounding boxes to be in YOLO format (normalized coordinates) during preprocessing and evaluation. The DetrImageProcessor class helps achieve this format through the do_convert_annotations parameter. The format_image_annotations_as_coco function is used for formatting annotations for visualization and is not directly related to the model's input format.

Aug 17 '24 19:08 sadafwalliyani

The format_image_annotations_as_coco function is used for formatting annotations for visualization and is not directly related to the model's input format.

I don't think is true. The format_image_annotations_as_coco is used to format the annotation in the correct structure for the image processor, as you can see in augment_and_transform_batch function. However, I'm not as concerned about this function anymore as there are no transformations being performed on the bounding boxes here. I am really just wondering what format the bounding boxes should be in at each step in the process.

Augmentations - format should be COCO, since we are specifying format="coco"
Image Processor - input format is COCO, output format is YOLO (which is the format the DETR model expects the input bounding boxes)
DetrForObjectDetection model - input format is YOLO
DetrForObjectDetection model predictions - it looks like the model outputs the prediction bounding boxes in YOLO, is that correct? This would make sense if it matched the input format
image_processor.post_process_object_detection - it looks like the output of this is in Pascal VOC format. The documentation states that this function, "Converts the raw output of DetrForObjectDetection into final bounding boxes in (top_left_x, top_left_y, bottom_right_x, bottom_right_y) format." And we are passing in the YOLO boxes and the original image sizes, and the output boxes contain absolute values. This explains why convert_bbox_yolo_to_pascal is being called to convert the target bboxes, since we need the bboxes in the same format for evaluation.
Evaluation - Shouldn't the target bboxes and prediction bboxes should be in the same format? In this example, if I'm understanding correctly, both bboxes are in Pascal VOC format. Therefore, we do actually need to pass 'xyxy' and not 'xywh' like we thought.

TLDR, the guide is actually right aside from the docstring typo that we pointed out. It just wasn't clear to me that the image processor was converting boxes from COCO to YOLO, and that the image_processor.post_process_object_detection was converting the boxes from YOLO to Pascal VOC. The use of three different bounding box formats is confusing and seems unnecessary.

Aug 19 '24 21:08 tkh5044

Can we get some more attention on this issue? From the example and documentation this is simply not explained properly enough. The fact that literally three different formats are being used in one example seems absolutely unnecessary as @tkh5044 pointed out. Furthermore, I seem to be getting negative bounding box coordinates from the post_process_object_detection method, even though I am certain that the correct input format was chosen for the ImageProcessor (coco format).

Sep 11 '24 15:09 mesllo-bc

Hi all, thanks for discussing it!

At the moment it indeed looks complicated with 3 different box format used and I have plans to simplify annotations / bounding boxes for object detection. If you have any suggestions on how the interface should look like you are welcome 🤗

Sep 11 '24 18:09 qubvel

I find it most confusing that the DETR implementation requires labels in the COCO format but with bounding boxes in the YOLO format. Fixing this would help but is obviously beyond fixing the tutorial.

I would love to see converting any dataset labels into YOLO (if that is what DETR wants) and sticking with those. For visualizing purposes, maybe add a yolo_coco converter.

Setting do_convert_annotations=False would make it easier, because then the output labels would follow the input labels.

Sep 18 '24 20:09 daniel-bogdoll

@daniel-bogdoll Thanks, sounds good to me! In case you have the capacity you can make a PR, contributions are very welcome 🤗 Anyway thanks a lot for your suggestions!

Sep 19 '24 09:09 qubvel

Probably I won"t, but I just found out that there is one more place where the bounding boxes are being changed. I feel like this only adds to the confusion:

post_process_object_detection Converts the raw output of DetrForObjectDetection into final bounding boxes in (top_left_x, top_left_y, bottom_right_x, bottom_right_y) format.

post_process_object_detection

Sep 23 '24 12:09 daniel-bogdoll

Probably I won"t, but I just found out that there is one more place where the bounding boxes are being changed. I feel like this only adds to the confusion:

post_process_object_detection Converts the raw output of DetrForObjectDetection into final bounding boxes in (top_left_x, top_left_y, bottom_right_x, bottom_right_y) format.

post_process_object_detection

Didn't you already point out this place in your earlier step-by-step comment? That call is used to convert the predictions into Pascal VOC format before converting the target boxes into Pascal VOC too, and then comparing to get the mAP score.

Sep 23 '24 14:09 mesllo-bc

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Oct 18 '24 08:10 github-actions[bot]

Hi all, the documentation still does not specify the expected format of the labels['boxes'] when passing labels to .forward(). Could someone specify the expected format and normalization of boxes (e.g. normalized center_x,center_y,w,h all in 0-1), and could it be added to the documentation?

Sorry if I'm missing it elsewhere. Thank you

Feb 28 '25 21:02 sammlapp