transformers icon indicating copy to clipboard operation
transformers copied to clipboard

[WIP] Add DINO DETR Model to HuggingFace Transformers

Open konstantinos-p opened this issue 9 months ago • 31 comments

What does this PR do?

This PR introduces the DINO DETR (DEtection TRansformer with DIstillation) model (https://arxiv.org/abs/2203.03605) to the Hugging Face Transformers library. DINO DETR is a state-of-the-art object detection model that builds upon the original DETR architecture, incorporating improvements such as:

  • Contrastive denoising training to enhance object queries.
  • Mixed query selection for more robust matching between predictions and ground truths.
  • Look forward twice (LFT) mechanism to refine object boxes.
  • Efficient matching and distillation techniques that stabilize bipartite matching loss.

The model achieves strong performance on COCO test-dev (https://paperswithcode.com/sota/object-detection-on-coco).

Fixes #36205

What's included

  • Implementation of the DinoDetrModel, and DinoDetrForObjectDetection
  • Implementation of DinoDetrImageProcessor

Resources I've used

  • I've based this imlementation on the original repo https://github.com/IDEA-Research/DINO/tree/main and I am testing with the "checkpoint0011_4scale.pth" checkpoint which can be found in https://drive.google.com/drive/folders/1qD5m1NmK0kjE5hh-G17XUX751WsEG-h_?usp=sharing
  • The original repository contains custom cuda kernels for the deformable attention. In their place, I've used a copy of the Deformable Attention implementation found in the Deformable Detr implementation of Transformers https://github.com/huggingface/transformers/blob/main/src/transformers/models/deformable_detr/modeling_deformable_detr.py.

Who can review?

@amyeroberts, @qubvel

konstantinos-p avatar Mar 14 '25 06:03 konstantinos-p

Hello @qubvel! The state of the PR is that I've gotten the forward pass to match the original implementation up to the required precision. I've marked this as a draft because I wanted to just get a first opinion on if I'm modifying the correct files in the codebase. Let me know if I'm missing anything big. Regarding tests, i've copied some from Deformable Detr but haven't tried getting them to work. Let me know if I need to add any apart from what's already there.

konstantinos-p avatar Mar 14 '25 06:03 konstantinos-p

Hi @konstantinos-p! Thanks a lot for working on the model, super excited to see it merged! 🚀

Before diving into the implementation details, here are some general comments we need to address in the PR:

  • [ ] Integration Tests
    Let's set up integration tests to ensure output consistency is maintained while refactoring the modeling code. Please use torch.testing.assert_close when comparing tensors (not use self.assertTrue(torch.allclose(...))

  • [ ] Modeling Code
    It would be super nice to use modular approach! It's the way to use inheritance in transformers while keeping a one-model-one-file format, please see more information here: https://huggingface.co/docs/transformers/main/en/modular_transformers and in examples/modular-transformers folder of the repo. Also, Siglip2 and RT-DETRv2 model were added using modular.

  • [ ] Consistent Naming for Classes & Modules

    • Model names and types: DinoDetr and dino_detr
    • All module names should be prefixed with the model name. For example, instead of ResidualBlock, use DinoDetrResidualBlock.
  • [ ] Code Paths & Cleanup

    • No asserts in code and minimum raise statements, config params should be validated in config.

    • Remove unused conditional statements if no existing configuration requires them. Also, remove the unnecessary config parameters.

      # Instead of:
      if config.use_last_layer_norm:
          self.norm = nn.LayerNorm()
      else:
          self.norm = nn.Identity()
      
      # If `use_last_layer_norm = True` in all configs, simplify it to:  
      self.norm = nn.LayerNorm()
      
  • [ ] Code Style

    • Use clear, descriptive variable names: avoid single-letter (x) or placeholder (tmp) variables.
    • Add comments for non-obvious code.
    • Use type hints for modules (e.g. in __init__ and forward).
  • [ ] Conversion Script Standard
    Please follow the mllama model format for the conversion script. Define a single key mapping using regex (see rt_detr_v2, ijepa, mllama, superglue models for reference). This is our new standard for all newly added models.

Thanks again! Looking forward to the updates 🚀

qubvel avatar Mar 14 '25 13:03 qubvel

Thanks for the comments! I'll start addressing them!

konstantinos-p avatar Mar 17 '25 07:03 konstantinos-p

Hello @qubvel I'm facing a similar issue to this one. Basically I'm rewriting the model to use modular but instead of DinoDetr Dino get's captured by the compiler as the model name. Is there a fix for this? Alternatively I could apply modular only on the processor. @Cyrilvallez was mentioned as possibly having faced something similar.

konstantinos-p avatar Apr 10 '25 15:04 konstantinos-p

Yes, see my answer here https://github.com/huggingface/transformers/pull/36895#issuecomment-2815598690!

Cyrilvallez avatar Apr 18 '25 15:04 Cyrilvallez

@Cyrilvallez thanks for the fix, it seems to work for me!

konstantinos-p avatar May 11 '25 16:05 konstantinos-p

Hello @qubvel ! I've addressed your initial comments, I've added tests which pass + a first draft of docs etc. I think it would be a good time to have a second look! CircleCI tests were passing as well but now are failing after I rebased on recent changes. I'll need some help with these as well!

konstantinos-p avatar May 11 '25 17:05 konstantinos-p

Hello @qubvel! It would be great if we could make a final push and merge this! I'd rather work on it while the key points are still fresh.

DINO Detr is still very competitive and an important model in the Detr line on top of which one could create new architectures.

konstantinos-p avatar Oct 05 '25 15:10 konstantinos-p

cc @nielsrogge @yonigozlan @molbap

Rocketknight1 avatar Oct 06 '25 15:10 Rocketknight1

Hey @konstantinos-p , I will be taking a look at your PR now! First of all, I'm seeing a few main conflicts that are mainly due to https://github.com/huggingface/transformers/pull/40760, could you

git pull main 
git merge main
pip install -e .[quality]

and for the code_quality run make fixup once that's done. I will do a pass on your PR now, and another one when the conflicts are solved and the branch up-to-date. Let me know if you encounter issues!

molbap avatar Oct 09 '25 08:10 molbap

@konstantinos-p can you merge main into your branch by the way please? :hugs: The base version is from May and this particular one was yanked :sweat_smile:

molbap avatar Oct 13 '25 09:10 molbap

@konstantinos-p can you merge main into your branch by the way please? 🤗 The base version is from May and this particular one was yanked 😅

Hello @molbap ! Sounds good! I've been doing fetch/rebase main. (Not really familiar with the finer differences between rebase/merge) What is the difference with doing merge? (Also could you elaborate a bit on the yanked part :) ) I've addressed the comments as far as I can tell and also did another passthrough over the modular/modeling code to add some comments/cleanup. I guess you can have a second look!

konstantinos-p avatar Oct 14 '25 06:10 konstantinos-p

Hey @konstantinos-p, are you sure you've rebased on origin:main? merge or rebase make no difference here as we'll squash in the end, but the transformers version should be v5-dev, not 4.52 (and that one in particular was yanked, i.e. removed from PyPi because it had several critical issues)

taking another look now :)

molbap avatar Oct 15 '25 11:10 molbap

I'll have a look at the rebase! Btw, I tried using the SinePositionEmbedding from deformable detr but unfortunately it doesn't exactly match as far as I can tell. Although I think I can use LearnedPositionEmbedding as is.

I've also removed alot of the original config parameters. I did this by looking at the original repo and checking if they change with any configs for other variants of the model (I also used my own judgement). I think it might be possible to also remove the two_stage_type and simplify things further. Then the only implementation would be the standard one which is the default.

konstantinos-p avatar Oct 15 '25 15:10 konstantinos-p

Regarding the rebase: I see for example that my commit history on this branch includes

https://github.com/huggingface/transformers/commit/72a3fc275c6161fcb98664c81aae8038a76407ce

from two weeks ago, because I rebased then. At that time I ran

git fetch upstream
git rebase upstream/main

I see also that the test_modeling_common.py keeps having conflicts even though I ran make style, I'll try to fix that. (should be fixed by a rebase or adding manually the longer lines)

konstantinos-p avatar Oct 20 '25 07:10 konstantinos-p

To fix conflicts usually I just do (after syncing my fork):

git checkout main
git pull 
git checkout my_branch_to_update
git merge main

then fix conflicts that appeared (test_modeling_common should pop out), among a few.

I did that with your branch and got 3 conflicting files, mostly linked to our removal of support for tensorflow and torchscript + some changes on deformable detr testing. image You can just accept all incoming changes.

Once this is done, I rather suggest you use fixup, once back to your branch

pip install -e .[quality]
make fixup

that should help clear out some of the code quality issues. The linter has been updated, a few things have changed regarding typing (list, not List), etc, so doing that should clear up the code and the CI a bit :hugs:

molbap avatar Oct 21 '25 09:10 molbap

All tests are passing on my side! test_modelling_common just needs to be updated to the latest main changes before merging so that some lines have the correct length.

konstantinos-p avatar Oct 25 '25 10:10 konstantinos-p

Hello @molbap! Happy to apply any changes in the upcoming days if you have time for a review! It would be great to capitalize on the momentum on this one!

konstantinos-p avatar Oct 30 '25 17:10 konstantinos-p

Hey @konstantinos-p, thanks for your patience! reviewing today :)

molbap avatar Oct 31 '25 14:10 molbap

Nice for the tests! Left a review to align with the new, leaner API for tuples, attentions and hidden states

Sounds good! I'll get right on these :)

konstantinos-p avatar Nov 06 '25 16:11 konstantinos-p

Thanks a lot! Ping me if you encounter issues :hugs:

molbap avatar Nov 06 '25 21:11 molbap

Hello @molbap I can't figure out how _can_record_outputs works.

  • I added this to the Model and Object Detection model classes like so
  1643 ▎ │   _can_record_outputs = {
     1 ▎ │   │   "encoder_self_attentions": OutputRecorder(DinoDetrEncoderLayer, layer_name="self_attn", index=1),
     2 ▎ │   │   "decoder_self_attentions": OutputRecorder(DinoDetrDecoderLayer, layer_name="self_attn", index=1),
     3 ▎ │   │   "decoder_cross_attentions": OutputRecorder(DinoDetrDecoderLayer, layer_name="cross_attn", index=1),
     4 ▎ │   }
  • I also added **kwargs: Unpack[TransformersKwargs], to most forward functions of the different modules.

I then tried calling the model forward with output_encoder_self_attentions=True and some variants of this however the output seems unchanged. On a side note: I also couldn't find any docs or tests that run this functionality. I could only find references to the old approach. Edit: I think I found it the layer name was wrong

konstantinos-p avatar Nov 16 '25 18:11 konstantinos-p

Hey @konstantinos-p , we are currently patching 4.57 and also doing a lot of work to prepare for v5, apologies bandwidth is limited :bow: Did you manage to solve you can_record_outputs problem?

molbap avatar Nov 20 '25 16:11 molbap

Hello @molbap! No worries, yes solved it! Currently addressing some last comments. I should be done tomorrow morning and will ping you for another look. (Currently tests are failing but that should be solved by a simple rebase)

konstantinos-p avatar Nov 20 '25 17:11 konstantinos-p

Hey @molbap! Short update on this: I finished up with the comments last week and rebased, then a bunch of stuff broke. I think I fixed most of the failing tests but test_can_use_safetensors and test_load_save_without_tied_weights keep failing. I'm still working on this but any tips would be appreciated :)

I see also

     if safe_serialization:
     # TODO: fix safe_serialization for tied weights
     # Safetensors does not allow tensor aliasing.
     # We're going to remove aliases before saving
     ptrs = collections.defaultdict(list)

inside save_pretrained so maybe this is that's just not working atm?

konstantinos-p avatar Nov 24 '25 11:11 konstantinos-p

@Cyrilvallez I'm getting some weird behavior with the recent changes on using tied weights and loading weights. Specifically, save_pretrained with tied weights saves only the last weight instance among tied weight groups, then for test_can_use_safe_tensors when trying to load back the tied weights from a saved checkpoint, everything apart from the last weight is (wrongly) identified as missing. The final assertion in test_can_use_safe_tensors fails as well. However, I think that the weights are loaded correctly.

    def test_can_use_safetensors(self):
        for model_class in self.all_model_classes:
            config, _ = self.model_tester.prepare_config_and_inputs_for_common()
            model_tied = model_class(config)
            with tempfile.TemporaryDirectory() as d:
                try:
                    model_tied.save_pretrained(d, safe_serialization=True)
                except Exception as e:
                    raise Exception(f"Class {model_class.__name__} cannot be saved using safetensors: {e}")
                with self.subTest(model_class):
                    model_reloaded, infos = model_class.from_pretrained(d, output_loading_info=True)
                    # Checking the state dicts are correct
                    reloaded_state = model_reloaded.state_dict()
                    for k, v in model_tied.state_dict().items():
                        with self.subTest(f"{model_class.__name__}.{k}"):
                            torch.testing.assert_close(
                                v,
                                reloaded_state[k],
                                msg=lambda x: f"{model_class.__name__}: Tensor {k}: {x}.\n{v}\nvs\n{reloaded_state[k]}\n"
                                "This probably means that it was not set with the correct value when tying.",
                            )

                    # Checking the tensor sharing are correct on the new model (weights are properly tied in both cases)
                    ptrs = defaultdict(list)
                    for k, v in model_tied.state_dict().items():
                        ptrs[v.data_ptr()].append(k)

                    shared_ptrs = {k: v for k, v in ptrs.items() if len(v) > 1}

                    for shared_names in shared_ptrs.values():
                        reloaded_ptrs = {reloaded_state[k].data_ptr() for k in shared_names}
                        self.assertEqual(
                            len(reloaded_ptrs),
                            1,
                            f"The shared pointers are incorrect, found different pointers for keys {shared_names}. `__init__` and `from_pretrained` end up not tying the weights the same way.",
                        )

                    # Checking there was no complain of missing weights
>                   self.assertEqual(
                        infos["missing_keys"],
                        set(),
                        "These keys were removed when serializing, and were not properly loaded by `from_pretrained`.",
                    )
E                   AssertionError: Items in the first set but not the second:
E                   'model.bbox_embed.1.layers.0.weight'
E                   'model.transformer.decoder.bbox_embed.0.layers.1.bias'
E                   'model.bbox_embed.0.layers.0.weight'
E                   'model.transformer.decoder.bbox_embed.0.layers.2.weight'
E                   'model.transformer.decoder.bbox_embed.0.layers.2.bias'
E                   'model.bbox_embed.1.layers.1.bias'
E                   'model.bbox_embed.0.layers.1.weight'
E                   'model.bbox_embed.0.layers.2.weight'
E                   'model.bbox_embed.1.layers.2.weight'
E                   'model.transformer.decoder.bbox_embed.0.layers.0.weight'
E                   'model.bbox_embed.0.layers.1.bias'
E                   'model.bbox_embed.0.layers.0.bias'
E                   'model.bbox_embed.1.layers.0.bias'
E                   'model.class_embed.0.bias'
E                   'model.transformer.decoder.class_embed.0.weight'
E                   'model.class_embed.0.weight'
E                   'model.bbox_embed.1.layers.1.weight'
E                   'model.transformer.decoder.bbox_embed.0.layers.0.bias'
E                   'model.bbox_embed.1.layers.2.bias'
E                   'model.transformer.decoder.bbox_embed.0.layers.1.weight'
E                   'model.transformer.decoder.class_embed.0.bias'
E                   'model.class_embed.1.weight'
E                   'model.class_embed.1.bias'
E                   'model.bbox_embed.0.layers.2.bias' : These keys were removed when serializing, and were not properly loaded by `from_pretrained`.

For example In the above model.transformer.decoder.bbox_embed.1.layers.1.weight was saved and is not missing.

I'm using the following patterns for specifying tied weights

r"(?:model\.)?(?:transformer\.decoder\.)?bbox_embed\.\d+\.layers.0": r"(?:model\.)?(?:transformer\.decoder\.)?bbox_embed.\d+\.layers.0",
 r"(?:model\.)?(?:transformer\.decoder\.)?bbox_embed\.\d+\.layers.1": r"(?:model\.)?(?:transformer\.decoder\.)?bbox_embed.\d+\.layers.1",
 r"(?:model\.)?(?:transformer\.decoder\.)?bbox_embed\.\d+\.layers.2": r"(?:model\.)?(?:transformer\.decoder\.)?bbox_embed.\d+\.layers.2",
r"(?:model\.)?(?:transformer\.decoder\.)?class_embed\.\d+": r"(?:model\.)?(?:transformer\.decoder\.)?class_embed.\d+",

Am I missing something?

konstantinos-p avatar Nov 24 '25 14:11 konstantinos-p

Sorry about that! Could you rebase/merge with latest main and try again? Tied_weights have been fixed earlier today!

Cyrilvallez avatar Dec 01 '25 18:12 Cyrilvallez

@molbap Everything works on my side! I addressed most of the previous comments! @Cyrilvallez I rebased and tried again but I was still having problems with the tied weights. I ended up just removing the tied weights from this model as they were not really necessary and had caused me problems before. In this model the tied weights seem to come from the original authors wanting to have the option to have or not have different prediction heads per decoder layer during ablations. In the final model though they use the same prediction head which can be implemented without tied weights.

konstantinos-p avatar Dec 03 '25 13:12 konstantinos-p

If weights don't need to be tied, we also need to remove all associated comments/methods!

Cyrilvallez avatar Dec 08 '25 12:12 Cyrilvallez

Hey! Here are a few comments! Also, look like the processor etc were not added to auto mappings! Finally, it would be much better to break from the load_backbone legacy here to add the backbone into the model. Instead, just use AutoModel, with the associated config, or if backbone is only timm, can be directly TimmWrapper for example

Hey @Cyrilvallez ! I think I've addressed most of your comments! The only ones that are not very clear to me are:

  • How to fix the auto mappings; where should I apply the changes?
  • How to use AutoModel or TimmWrapper. Is there an example where these are used? (By default I'm using the timm resnet but there are a couple of checks about using another timm model as well as a HF backbone.)

konstantinos-p avatar Dec 09 '25 15:12 konstantinos-p