Conditional DETR's `post_process_semantic_segmentation` removes one class
System Info
transformers 4.57.3 python 3.13.9
Who can help?
@yonigozlan @molbap
Information
- [x] The official example scripts
- [x] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [x] My own task or dataset (give details below)
Reproduction
Let's run two identical models - DETR and Conditional DETR - with 100 classes
import io
import requests
from PIL import Image
import torch
import numpy
from transformers import (
AutoImageProcessor,
ConditionalDetrConfig,
ConditionalDetrForSegmentation,
)
from transformers.image_transforms import rgb_to_id
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
image_processor = AutoImageProcessor.from_pretrained("microsoft/conditional-detr-resnet-50")
# randomly initialize all weights of the model
config = ConditionalDetrConfig(num_labels=100)
model = ConditionalDetrForSegmentation(config)
# prepare image for the model
inputs = image_processor(images=image, return_tensors="pt")
# forward pass
cond_detr_outputs = model(**inputs)
# Use the `post_process_semantic_segmentation` method of the `image_processor` to retrieve post-processed semantic segmentation maps
cond_detr_result = image_processor.post_process_semantic_segmentation(cond_detr_outputs, target_sizes=[(300, 500)])
import io
import requests
from PIL import Image
import torch
import numpy
from transformers import AutoImageProcessor, DetrForSegmentation, DetrConfig
from transformers.image_transforms import rgb_to_id
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
image_processor = AutoImageProcessor.from_pretrained("facebook/detr-resnet-50-panoptic")
config = DetrConfig(num_labels=100)
model = DetrForSegmentation(config)
# prepare image for the model
inputs = image_processor(images=image, return_tensors="pt")
# forward pass
detr_outputs = model(**inputs)
# Use the `post_process_semantic_segmentation` method of the `image_processor` to retrieve post-processed semantic segmentation maps
detr_result = image_processor.post_process_semantic_segmentation(detr_outputs, target_sizes=[(300, 500)])
Now let's look at the models outputs:
>>> cond_detr_outputs.logits.shape
torch.Size([1, 300, 100])
>>> detr_outputs.logits.shape
torch.Size([1, 100, 101])
As we can see, Conditional DETR logits contain 100 classes, and DETR logits contain 101 class (100 + null class)
However, both Conditional DETR and DETR post_process_semantic_segmentation functions are the same: they both treat the input data like there are n+1 classes in it:
class_queries_logits = outputs.logits # [batch_size, num_queries, num_classes+1]
...
# Remove the null class `[..., :-1]`
masks_classes = class_queries_logits.softmax(dim=-1)[..., :-1]
The function removes the last class, which is correct for DETR (which have an additional null class) and incorrect for Conditional DETR (which does not have a null class) and may result in losing information about the last actual class.
Expected behavior
Do not remove the last class in Conditional DETR's post_process_semantic_segmentation function:
masks_classes = class_queries_logits.softmax(dim=-1)
cc @nielsrogge as well, and note PR at #42681