transformers icon indicating copy to clipboard operation
transformers copied to clipboard

✨ Add EoMT Model || 🚨 Fix Mask2Former loss calculation

Open yaswanth19 opened this issue 8 months ago • 26 comments

What does this PR do?

Fixes #37171 and continuation of #37392

yaswanth19 avatar Apr 18 '25 11:04 yaswanth19

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the Ready for review button (at the bottom of the PR page). This will assign reviewers and trigger CI.

github-actions[bot] avatar Apr 18 '25 11:04 github-actions[bot]

Image segmentation model so cc @qubvel @nielsrogge!

Rocketknight1 avatar Apr 22 '25 12:04 Rocketknight1

@qubvel A rough draft is ready for inference :hugs: . Now i am adding support for training and they use mask_annealing to determine probability for attn masks. Is it required in this HF implementation also (I don't see for any other model) or are we fine with having a fixed prob for attn mask

yaswanth19 avatar Apr 27 '25 19:04 yaswanth19

Hey @yaswanth19, do you mean the mask probability changes during the model training? Would be nice to have it but not sure it can be easily implemented tbh. Maybe we can just add a callback for a Trainer to change it? similar to learning rate callback (I mean not adding it to the Transformers actually, but to the docs/fine-tuning guide)

qubvel avatar Apr 28 '25 08:04 qubvel

do you mean the mask probability changes during the model training?

Yup exactly, and adding a trainer callback seems to be a good idea. I will check the feasibility of implementation and if simple enough then we can implement in the model itself else pivot to trainer callback.

yaswanth19 avatar Apr 28 '25 08:04 yaswanth19

@yaswanth19 Thanks for your great work!

A few thoughts: • Mask annealing: Setting it to a fixed probability of 1 requires masked attention during inference, which we want to avoid. A fixed probability of 0 removes masked attention but does hurt performance. If mask annealing isn’t feasible right now, I’d suggest setting it to 0 for now. If you do implement it, note that intermediate mask predictions will be required, so the _predict method might need to move into EoMTEncoder. Just make sure intermediate masks are skipped at inference to avoid slowing things down.
• Testing: It might be good to add an integration test that verifies both the mask and class outputs for semantic, instance, and panoptic segmentation.
• Weight initialization: The current approach likely misinitializes parameters like query embeddings (e.g. std=0.02). All parameters outside the ViT should follow PyTorch defaults (e.g. nn.Embedding uses std=1.0).
• ViT backbone: Would it be simpler to use timm for the ViT? This allows loading pre-trained weights when training EoMT from scratch, avoids unnecessary code, and avoids unsupported use cases like training from a randomly initialized ViT.
• Loss function: Could we reuse the Mask2Former loss? Unless it conflicts with Transformers guidelines, this might reduce redundancy.

Let me know your thoughts.

tommiekerssies avatar Apr 30 '25 12:04 tommiekerssies

Thanks @tommiekerssies for your initial thoughts.

Mask annealing: Setting it to a fixed probability of 1 requires masked attention during inference, which we want to avoid.

  • I am not sure abut mask annealing implementation compatibility with transformers natively. As I have said above, in the worst case we can set to 0 or use tariner callback if that's feasible.

Testing: It might be good to add an integration test

Yup, will add the complete test suite once I have a implementation ready.

Weight initialization: The current approach likely misinitializes parameters like query embeddings

Thanks for bringing this to my attention, I can make some correction to initialization later on when we have the end-to-end code ready. IMO, most of the user don't init from scratch and will either finetune it or just perform inference. But having said that I will look at timm implementation and will init in the same way

ViT backbone: Would it be simpler to use timm for the ViT?

Ideally yes :sweat_smile: But that's not the library coding standard (Don't want to introduce a hard dependency on timm). Also using timm backbone directly will not be compatible with all other features that HF ecosystem provides IMO. I am actually referring the HF VIT and timm implementation to get the best of both worlds and as to not introduce any bug.

Loss function: Could we reuse the Mask2Former loss?

Transformers has one model one file philosophy and because of that I have copied the Mask2Former loss completely here. It can be subjective call with Modular file in the sense we can expose Mask2Former loss and import it here for EoMT (Will require additional changes in mask2former) but that can be discussed during reviews with the core maintainer.

yaswanth19 avatar Apr 30 '25 12:04 yaswanth19

Thanks @tommiekerssies for your initial thoughts.

Mask annealing: Setting it to a fixed probability of 1 requires masked attention during inference, which we want to avoid.

  • I am not sure abut mask annealing implementation compatibility with transformers natively. As I have said above, in the worst case we can set to 0 or use tariner callback if that's feasible.

Testing: It might be good to add an integration test

Yup, will add the complete test suite once I have a implementation ready.

Weight initialization: The current approach likely misinitializes parameters like query embeddings

Thanks for bringing this to my attention, I can make some correction to initialization later on when we have the end-to-end code ready. IMO, most of the user don't init from scratch and will either finetune it or just perform inference. But having said that I will look at timm implementation and will init in the same way

ViT backbone: Would it be simpler to use timm for the ViT?

Ideally yes 😅 But that's not the library coding standard (Don't want to introduce a hard dependency on timm). Also using timm backbone directly will not be compatible with all other features that HF ecosystem provides IMO. I am actually referring the HF VIT and timm implementation to get the best of both worlds and as to not introduce any bug.

Loss function: Could we reuse the Mask2Former loss?

Transformers has one model one file philosophy and because of that I have copied the Mask2Former loss completely here.

Thanks for the clarifications!

Regarding mask annealing, I agree that 0 for now is fine. That means effectively disabling masked attention and mask annealing, which is what the current code already does, so no changes needed on that front.

For weight initialization and the ViT backbone, I understand the constraints around using timm. In that case, I’d just make sure that the non-ViT parameters (query embeddings, mask MLP, upscale blocks, class head) aren’t using any custom initializations and instead follow PyTorch defaults. Should be a quick fix.

Let me know if you’d like me to look at any part in more detail.

tommiekerssies avatar Apr 30 '25 13:04 tommiekerssies

Hi @qubvel ,I’m working on refactoring the training logic for EoMT , and I’m running into a design challenge:

In the original single‐class implementation, they call _predict (which uses the class predictor head) on intermediate layer outputs to build attention masks. Because everything lives in one class, this is straightforward.

Refer: https://github.com/tue-mps/eomt/blob/c311b377d3189c976163e4ceb2156d90bb7db88f/models/eomt.py#L130

In our modular HF version, the encoder (EoMTEncoder) only runs the transformer blocks, and _predict (with mask_head, upscale_block, and class_predictor) lives in EoMTModel or EoMTForUniversalSegmentation. That separation means the encoder loop can’t access _predict , so we can’t reconstruct the original training flow.

I have two solutions in mind, LMK your thoughts on the below approaches and suggest any other better alternative:

1.) Club all the classes from EoMTEncoder into EomtForUniversalSegmentation, in this ways we can do all the processing in a single forward class.

2.) Move _predict which includes mask_head, upscale_block into EoMTEncoder and somehow pass the class_head. Pass these modules into the encoder class so that inside its forward loop it can call _predict, build the attention mask, and feed it into the next block. IMO this is a bit dirty and flow is tangled 😅 .

Here the _predict func is the same code which is used in Mask2Former for get mask_logits and class_logits from model output.

yaswanth19 avatar May 01 '25 07:05 yaswanth19

@qubvel A ping for your comments on the above issue/challenge so that I can refactor it this weekend 😅

yaswanth19 avatar May 02 '25 11:05 yaswanth19

@yaswanth19 can you send me your email address so we can set up a Slack channel to discuss the integration, rather than having to discuss everything on the Github thread?

NielsRogge avatar May 02 '25 12:05 NielsRogge

Yup sure @NielsRogge , Here is my email id: [email protected]

yaswanth19 avatar May 02 '25 12:05 yaswanth19

Nice work @yaswanth19 ! For fixing the initialisation, maybe you could have a look here, as the Mask2Former implementation in HuggingFace suffers from the same bug: https://github.com/huggingface/transformers/issues/35877#issuecomment-2749038225

tommiekerssies avatar May 11 '25 17:05 tommiekerssies

Hi @tommiekerssies , Sorry for the delay - was occupied with other personal work - I will fix the initialization part 🤗. The model is ready and logits are matching as of now but code still needs some refactoring. The image processing is also ready but right now it doesn't support targets (masks and labels) and regarding that I have a few questions.

Can you explain what is the CLASS_MAPPING and INSTANCE_MAPPING in each xx_dataset.py files. Also it would be great if you could explain how are the mask and class labels extracted specifically the logic of target_parser function. It's quite different for each dataset class and I am having hard time on getting a common/general logic intuition to implement.

Another query is we are padding image two times - once in transforms and once again in function resize_and_pad_imgs_instance_panoptic. Why is it so 🤔

yaswanth19 avatar May 18 '25 10:05 yaswanth19

Hi @tommiekerssies , Sorry for the delay - was occupied with other personal work - I will fix the initialization part 🤗. The model is ready and logits are matching as of now but code still needs some refactoring. The image processing is also ready but right now it doesn't support targets (masks and labels) and regarding that I have a few questions.

Can you explain what is the CLASS_MAPPING and INSTANCE_MAPPING in each xx_dataset.py files. Also it would be great if you could explain how are the mask and class labels extracted specifically the logic of target_parser function. It's quite different for each dataset class and I am having hard time on getting a common/general logic intuition to implement.

Another query is we are padding image two times - once in transforms and once again in function resize_and_pad_imgs_instance_panoptic. Why is it so 🤔

Hi @yaswanth19, great work!

CLASS_MAPPING is used to remap the dataset’s class IDs to a contiguous range without gaps. INSTANCE_MAPPING is specific to ADE20K panoptic, which requires merging semantic and instance labels: we apply CLASS_MAPPING to the semantic labels (skipping “thing” classes) and INSTANCE_MAPPING to the instance labels (which include “thing” classes).

The target_parser function is dataset-specific. It converts labels from the dataset’s original format into the consistent (masks, labels) format expected by our training/evaluation code. For example, COCO instance annotations use RLE in JSON, while COCO panoptic labels are stored as PNGs. There are also quirks like COCO’s is_crowd flag, which needs special handling (ignored during training, but used for proper AP/PQ evaluation).

Due to these differences, it might not be ideal to include dataset-specific parsing logic in the HF Transformers codebase. Instead, the preprocessor can expect targets to already be in (masks, labels) format, leaving the conversion to users. What do you think?

Regarding padding: it’s not applied twice. The transform pipeline is used during training, where we apply random scale jittering, pad to square, and then crop. The resize_and_pad_imgs_instance_panoptic function is used only during evaluation, where we resize the long side to the input size and pad the short side to make the image square.

Let me know if anything’s unclear!

tommiekerssies avatar May 19 '25 13:05 tommiekerssies

Hi @yaswanth19 ! Is this ready for a review? Feel free to ping me when it is!

yonigozlan avatar May 20 '25 21:05 yonigozlan

@qubvel @NielsRogge The PR is ready is review. I have added some Todo's for me which are independent things and not a high priority. I will do them in parallel with reviews once I find some more time.

@tommiekerssies Please have a look at it if you have some bandwidth. IMO focus more the processing class because that's the module which differs from original implementation. AFAIK I have standardized the inference pre-processing correctly and LMK if you find any inconsistencies w.r.t pre and post processing.

yaswanth19 avatar May 25 '25 13:05 yaswanth19

Dear @yaswanth19 , great work! I’m currently on holiday and will review on Tuesday June 3rd when I’m back at work.

tommiekerssies avatar May 26 '25 15:05 tommiekerssies

The diff between modular and modeling file can be fixed when PR #38503 is merged which will make the CI happy. Fixed the previously failiing tests and all good to go 🚀 @yonigozlan

yaswanth19 avatar Jun 01 '25 10:06 yaswanth19

Hi @yaswanth19, I haven't finished my full review yet (for which I'd also like to test the code myself). However, one thing which might be useful to already mention, is that the current code doesn't seem to support DINOv2 giant, as the giant model uses a SwiGLU FFN. Should be relatively easy to support this though.

tommiekerssies avatar Jun 04 '25 11:06 tommiekerssies

Additionally I think the current weight init code will still cause nn.Embedding classes (such as the queries) to be initialized with 0.02 std, which should be default PyTorch init of 1.0 std.

tommiekerssies avatar Jun 04 '25 11:06 tommiekerssies

support DINOv2 giant, as the giant model uses a SwiGLU FFN

Ohh, I used large variants for testing, hence missed this one. Will make changes to support it, which should be easy.

yaswanth19 avatar Jun 04 '25 12:06 yaswanth19

Thanks @tommiekerssies for the review. Added my replies and my only concern is regarding the post-processor for instance segmentation and attn_mask_prob.

More details: https://github.com/huggingface/transformers/pull/37610#discussion_r2129262453

yaswanth19 avatar Jun 05 '25 16:06 yaswanth19

@yonigozlan The PR is ready for review.

yaswanth19 avatar Jun 11 '25 17:06 yaswanth19

@yonigozlan Addressed the last comments 🤗 and should be ready for @Cyrilvallez review.

I ran test_batching_equivalence multiple times locally but it did not fail for me 🤔 and neither is the case in CI, Seems weird when considering different hardwares. Also, updated the HUB checkpoints with fast image processor.

yaswanth19 avatar Jun 13 '25 05:06 yaswanth19

Also @yonigozlan can you trigger the CI so as to run all the slow tests to check if any other tests are failing or not. I don't have the capacity and env to run all the slow tests 😅

yaswanth19 avatar Jun 13 '25 05:06 yaswanth19

@Cyrilvallez @yonigozlan Now following camelcasing for model naming and replied to the query. I need to update the ckpts - waiting for your approval so that I can convert all the checkpoints and then @tommiekerssies can transfer the ckpts to tue-mps org 🚀 🤗

yaswanth19 avatar Jun 18 '25 16:06 yaswanth19

@Cyrilvallez The image processor is quite different from anything we have at the moment, as maskformer and mask2former don't have a fast image processor yet. However some modular refactoring should be possible once we get to these! But for now we wouldn't gain anything with modular so happy to merge as is

yonigozlan avatar Jun 18 '25 17:06 yonigozlan

@Cyrilvallez A gentle ping for your review/approval so we can complete the final steps of checkpoints transfer and get this merged soon 🤗 🚀

yaswanth19 avatar Jun 25 '25 07:06 yaswanth19

Hi @tommiekerssies Can you please add me to the tue-mps org so I can transfer the checkpoints from my account to org account. The checkpoints are:

  • yaswanthgali/ade20k_panoptic_eomt_large_640
  • yaswanthgali/coco_panoptic_eomt_large_640-hf
  • yaswanthgali/ade20k_semantic_eomt_large_512-hf
  • yaswanthgali/coco_instance_eomt_large_640-hf

I've also added a draft of the README—please feel free to review and suggest edits (Your can do it yourself once the checkpoints are in org). Once the checkpoints are transferred, I’ll update the model_id references accordingly.

Do you want to keep this new checkpoints along with existing ones then these new ones can be suffixed with "-hf" like tue-mps/coco_panoptic_eomt_large_640-hf or do you want to replace the existing one?

Additionally, there are a few more checkpoints that still need to be converted to the HF format. If it works for you, it would be great if you could upload those directly under the tue-mps org.

yaswanth19 avatar Jun 26 '25 09:06 yaswanth19