Question about the training for sementic segmentation.
Thanks for sharing your great work and I read your paper.
I have one question about the training for semantic segmentation on ADE20K. For instance segmentation and panoptic segmentation, we can get GT bbox from mask of thing class. But, for sementic segmentation, masks of the same class are grouped together so we cannot use instance level mask and get GT bbox. How did you train Mask DINO for semantic segmentation? Did you treat multiple masks of the same class as a single instance? Or did you consider all classes as stuff classes and remove the box loss and box matching?
Thank you for reply.
You mean that you don't use bbox in sementic segmentation (ADE20k) training, right?
Sorry I misunderstood your question. We do not use intance-level annotation of things for semantic segmentation. We do use bboxes in semantic segmentation (generated from all masks of the same category). Please refer to Mask2Former for more training details, as we use the same setting as this paper. In addition, we still calculate the box loss for things in semantic segmentation. Thank you.
I understand.
If there are two people in the upper right and lower left of the image, does that mean there is one bbox for the person class and its size is large?
Yes, your understanding is correct.
Thank you for your quick and polite response.
I look forward to the code and model being released!