dino-vit-features
dino-vit-features copied to clipboard
Why is only dino_vits8 supported?
In the examples, if you change the model_type to anything other than dino_vits8
the code crashes because of an assert in ViTExtractor.extract_saliency_maps
. What needs to change to properly support other model types?
+1 This is weird, hope the authors can give us an answer. Thks
@juancamilog I read the code again and find out that this may be because the authors have only tried the head_idxs = [0, 2, 4, 5] for vits. Since vit-b/l/g have different number of heads, the suitable head_idxs also should be found. But that's not a simple question...
Hi!
The saliency maps used in the co-segmentation, part co-segmentation and correspondences examples is acquired by aggregating heads 0, 2, 4, 5 of Dino_vit8s. We removed heads 1 and 3 as they empirically attended bg areas. It is also possible to change the code to use different Dino_vits aggregating all heads, but would require adjusting some of the hyper parameters in each application.
LMK if you have further questions 🙏
Thks for your reply. If using vit models with more heads, the 0,2,4,5 idx can keep the same? I'm not sure the larger model's head will get the similar attention result.
By the way, I have tried using dinov2's weight, but find out the result is even worse. It seems that the patch_size will significantly influence the foreground seg results. The smaller the patch_size, the better the result can be get. Do you have any ideas about using dinov2's feature? Because I only found DINO_vit14 pretrained model in their repo.
@RickyYXY did you get it working for vitb?