Conditional DETR's `post_process_semantic_segmentation` removes one class

Open simonreise opened this issue 3 weeks ago • 1 comments

System Info

transformers 4.57.3 python 3.13.9

Who can help?

@yonigozlan @molbap

Information

[x] The official example scripts
[x] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[x] My own task or dataset (give details below)

Reproduction

Let's run two identical models - DETR and Conditional DETR - with 100 classes

import io
import requests
from PIL import Image
import torch
import numpy

from transformers import (
    AutoImageProcessor,
    ConditionalDetrConfig,
    ConditionalDetrForSegmentation,
)
from transformers.image_transforms import rgb_to_id

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

image_processor = AutoImageProcessor.from_pretrained("microsoft/conditional-detr-resnet-50")

# randomly initialize all weights of the model
config = ConditionalDetrConfig(num_labels=100)
model = ConditionalDetrForSegmentation(config)

# prepare image for the model
inputs = image_processor(images=image, return_tensors="pt")

# forward pass
cond_detr_outputs = model(**inputs)

# Use the `post_process_semantic_segmentation` method of the `image_processor` to retrieve post-processed semantic segmentation maps
cond_detr_result = image_processor.post_process_semantic_segmentation(cond_detr_outputs, target_sizes=[(300, 500)])

import io
import requests
from PIL import Image
import torch
import numpy

from transformers import AutoImageProcessor, DetrForSegmentation, DetrConfig
from transformers.image_transforms import rgb_to_id

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

image_processor = AutoImageProcessor.from_pretrained("facebook/detr-resnet-50-panoptic")

config = DetrConfig(num_labels=100)
model = DetrForSegmentation(config)

# prepare image for the model
inputs = image_processor(images=image, return_tensors="pt")

# forward pass
detr_outputs = model(**inputs)

# Use the `post_process_semantic_segmentation` method of the `image_processor` to retrieve post-processed semantic segmentation maps
detr_result = image_processor.post_process_semantic_segmentation(detr_outputs, target_sizes=[(300, 500)])

Now let's look at the models outputs:

>>> cond_detr_outputs.logits.shape
torch.Size([1, 300, 100])

>>> detr_outputs.logits.shape
torch.Size([1, 100, 101])

As we can see, Conditional DETR logits contain 100 classes, and DETR logits contain 101 class (100 + null class)

However, both Conditional DETR and DETR post_process_semantic_segmentation functions are the same: they both treat the input data like there are n+1 classes in it:

class_queries_logits = outputs.logits  # [batch_size, num_queries, num_classes+1]
...
# Remove the null class `[..., :-1]`
masks_classes = class_queries_logits.softmax(dim=-1)[..., :-1]

The function removes the last class, which is correct for DETR (which have an additional null class) and incorrect for Conditional DETR (which does not have a null class) and may result in losing information about the last actual class.

Expected behavior

Do not remove the last class in Conditional DETR's post_process_semantic_segmentation function:

masks_classes = class_queries_logits.softmax(dim=-1)

Dec 07 '25 01:12 simonreise

cc @nielsrogge as well, and note PR at #42681

Dec 08 '25 14:12 Rocketknight1