dino-vit-features icon indicating copy to clipboard operation
dino-vit-features copied to clipboard

Why is only dino_vits8 supported?

Open juancamilog opened this issue 1 year ago • 5 comments

In the examples, if you change the model_type to anything other than dino_vits8 the code crashes because of an assert in ViTExtractor.extract_saliency_maps. What needs to change to properly support other model types?

juancamilog avatar Mar 09 '23 02:03 juancamilog

+1 This is weird, hope the authors can give us an answer. Thks

RickyYXY avatar Jun 04 '23 16:06 RickyYXY

@juancamilog I read the code again and find out that this may be because the authors have only tried the head_idxs = [0, 2, 4, 5] for vits. Since vit-b/l/g have different number of heads, the suitable head_idxs also should be found. But that's not a simple question...

RickyYXY avatar Jun 05 '23 03:06 RickyYXY

Hi!

The saliency maps used in the co-segmentation, part co-segmentation and correspondences examples is acquired by aggregating heads 0, 2, 4, 5 of Dino_vit8s. We removed heads 1 and 3 as they empirically attended bg areas. It is also possible to change the code to use different Dino_vits aggregating all heads, but would require adjusting some of the hyper parameters in each application.

LMK if you have further questions 🙏

ShirAmir avatar Jun 10 '23 21:06 ShirAmir

Thks for your reply. If using vit models with more heads, the 0,2,4,5 idx can keep the same? I'm not sure the larger model's head will get the similar attention result.

By the way, I have tried using dinov2's weight, but find out the result is even worse. It seems that the patch_size will significantly influence the foreground seg results. The smaller the patch_size, the better the result can be get. Do you have any ideas about using dinov2's feature? Because I only found DINO_vit14 pretrained model in their repo.

RickyYXY avatar Jun 18 '23 16:06 RickyYXY

@RickyYXY did you get it working for vitb?

krishnaadithya avatar Aug 01 '23 17:08 krishnaadithya