vision icon indicating copy to clipboard operation
vision copied to clipboard

Deformable attention implementation

Open Isalia20 opened this issue 2 months ago • 8 comments

Deformable attention implementation. Fixes: https://github.com/pytorch/pytorch/issues/112827

Isalia20 avatar Nov 06 '25 22:11 Isalia20

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/vision/9260

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot[bot] avatar Nov 06 '25 22:11 pytorch-bot[bot]

@NicolasHug Can you please review this PR?

Isalia20 avatar Nov 24 '25 18:11 Isalia20

Hi @Isalia20 , thanks for the PR.

This is a massive one :)

Can you share more about why this is needed? I shared my thought and questions on that before inhttps://github.com/pytorch/pytorch/issues/112827#issuecomment-1804905345. It seems like there are really solid implementations of deformable attention already, so we'd like to understand the added-value of having this natively in torchvision.

NicolasHug avatar Nov 26 '25 11:11 NicolasHug

There are 2 main reasons why this would be a nice addition:

  1. The deformable attention kernels are definitely solid, but they are outdated. When installing they give deprecation warnings for the kernels. So it's a matter of time before torch removes some of the deprecated features and these kernels stop working. As I remember when implementing this PR, one of them was due to usage of .data<scalar_t>() for acccessing the data_ptr. In this PR that is replaced with the new data_ptr() so no warnings are given. See: https://github.com/Isalia20/vision/blob/cc76d33e073f9ecf7a8af695c31708663c6a7fac/torchvision/csrc/ops/cuda/deform_attn_kernel.cu#L1691 https://github.com/facebookresearch/dinov3/blob/54694f7627fd815f62a5dcc82944ffa6153bbb76/dinov3/eval/segmentation/models/utils/ops/src/cuda/ms_deform_attn_cuda.cu#L142C42-L142C59

  2. Another one is to have it directly in torchvision, so users don't need to additionally build the kernels themselves like it's done in the SOTA vision model like DINOv3: https://github.com/facebookresearch/dinov3/tree/main/dinov3/eval/segmentation/models/utils/ops/src/cuda

Isalia20 avatar Nov 26 '25 11:11 Isalia20

@NicolasHug Any updates on this?

Isalia20 avatar Dec 04 '25 20:12 Isalia20

@NicolasHug kind reminder for this. Would be great to get this in the 2.10 release

Isalia20 avatar Dec 09 '25 10:12 Isalia20

Hi @Isalia20 . I just want to be fully transparent to manage expectations: we won't be able to make this happen for 2.10. Reviewing this PR is a massive amount of work and the team bandwidth is low - especially for non-preproc/transforms stuff, which is what we're focusing on right now.

I am honestly still unsure of the value that having this in TV would bring. The implementation in mmvc seems to be maintained, so hopefully deprecation warnings aren't that big of a deal there?

NicolasHug avatar Dec 09 '25 10:12 NicolasHug

@NicolasHug Got it thanks. mmcv's latest commit on main was 8 months ago, which means that project isn't really maintained. If this PR is too big, I can chop it up in smaller pieces to make it manageable to review. I have to think on how to really structure it through multiple PRs but if that helps I can do it

Isalia20 avatar Dec 09 '25 10:12 Isalia20

@NicolasHug should I split it into multiple smaller ones to make it easier to review?

Isalia20 avatar Dec 11 '25 09:12 Isalia20