D-FINE icon indicating copy to clipboard operation
D-FINE copied to clipboard

key shape torch.Size([400, 1, 256]) does not match value shape torch.Size([1, 1, 256])

Open EugeoSynthesisThirtyTwo opened this issue 4 months ago • 0 comments

Describe the bug I followed the readme to train hgnetv2 on coco dataset. I downloaded coco dataset, modified yml files as instructed, and ran the train command.

It instantly raises a pytorch error:

  File "/home/gabriel/Documents/D-FINE/src/zoo/dfine/hybrid_encoder.py", line 279, in forward
    src, _ = self.self_attn(q, k, value=src, attn_mask=src_mask)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/gabriel/miniconda3/envs/dfine/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/gabriel/miniconda3/envs/dfine/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1879, in _call_impl
    return inner()
           ^^^^^^^
  File "/home/gabriel/miniconda3/envs/dfine/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1827, in inner
    result = forward_call(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/gabriel/miniconda3/envs/dfine/lib/python3.11/site-packages/torch/nn/modules/activation.py", line 1380, in forward
    attn_output, attn_output_weights = F.multi_head_attention_forward(
                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/gabriel/miniconda3/envs/dfine/lib/python3.11/site-packages/torch/nn/functional.py", line 6293, in multi_head_attention_forward
    assert key.shape == value.shape, (
           ^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: key shape torch.Size([400, 1, 256]) does not match value shape torch.Size([1, 1, 256])

By tracing back the tensor, it seems like the tensor of size[1, 1, 256] has been produced just after the backbone (HGNetv2) in the DFINE.forward method

class DFINE(nn.Module):
    __inject__ = [
        "backbone",
        "encoder",
        "decoder",
    ]

    def __init__(
        self,
        backbone: nn.Module,
        encoder: nn.Module,
        decoder: nn.Module,
    ):
        super().__init__()
        self.backbone = backbone
        self.decoder = decoder
        self.encoder = encoder

    def forward(self, x, targets=None):
        x = self.backbone(x) # <-- Here x becomes the wrong shape
        x = self.encoder(x)
        x = self.decoder(x, targets)

I am clueless about what I am doing. The readme is really hard to follow for a newbie, and I don't know how this is supposed to work.

I tried with the .yml file of d-fine-s trained on coco+object365.

By the way, I wanted to finetune D‑FINE‑S, not HGNetv2. How can I fine tune the model I downloaded ? How can I be sure it finetunes the right model trained on the right datasets ?

Thank you.

Desktop (please complete the following information):

  • OS: Ubuntu 24.04
  • Version: Just the latest commit on master

EugeoSynthesisThirtyTwo avatar Aug 20 '25 17:08 EugeoSynthesisThirtyTwo