pytorch-image-models
pytorch-image-models copied to clipboard
[FEATURE] Add Hiera
Add a vision model from Meta
"Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles" https://github.com/facebookresearch/hiera/tree/main
@raulcarlomagno I like this model quite a bit, neat ideas, but they've marked both the code and weights as non-commercial. I can deal with the weights, I treat them with separate licenses on the HF hub, but cannot bring NC code into timm...
Given that, it takes more effort to do a clean room impl / from first principles and I have a lot of things in progress right now. Or you could bug them to drop the NC license on the code and just keep it for the weights...
@rwightman @raulcarlomagno Hi, we've made the license for Hiera code Apache 2.0. (We cannot do anything about the model licenses unfortunately.) Would love to support integration into timm!
@chayryali that's great! I think it shouldn't be too had to get it in, the style is pretty much in line with timm already ... just a number of timm specific additions for model builder, and some extra functionality. I'll have to take another look at impl ...
Weight license will be handled w/ appropriate license in the HF model hub and also a comment / tag in the implementation where the pretrained weight links are.
@chayryali so, been juggling just a few things lately, but do have this model working locally in timm.
I've been trying to add support for changing resolution though, either on init (diff input (img) size passed to model) or on the fly in forward.
As soon as the resolution is changed the model accuracy drops off the cliff, haven't had issue resizing vanilla vits and related models, or any of the window'd variants like swin, maxvit, etc ...
If I hold the patch stride vs img size ratio constant it appears to work, but that constrains the possibilities significantly...
@rwightman Great to hear it's working locally!
Regarding changing the resolution, it turns out (paper) the drop in performance is due to the interaction between window attention and absolute positional encoding. It also affects ViT (but typically in detection settings e.g. ViTDet, where it's more common to use window attention).
The fix is really simple, we make the abs position embeddings "window-aware" by maintaining two position embeddings, a window embedding (e.g. 8x8) and a global embedding (e.g. 7x7). The global embedding is interpolated to 56x56 (for 224x224 res) and the window embedding is tiled to 56x56 and added together to form the final position encoding. We are actually about to release the corresponding "absolute win" image and video models soon.
@chayryali nice, I hadn't seen that paper will have a read. I was working through an idea to add different ROPE pos to the window'd and global stages to see if that'd work but this appears simpler :)
Also, did a quick ablation while fiddling, instead of projecting for the residual shortcut, since it's 2x expansion by default avg + max pool seems to provide a similar, if not slightly faster learning progress comparing initial steps on a supervised learn task. Might have a different outlook for MAE pretrain though ... https://github.com/huggingface/pytorch-image-models/blob/d88bed653523e796ce325ce2213b0aeb6bed24c9/timm/models/hiera.py#L354-L364
could make that view less redundant there, but just fiddling :)
@chayryali read the paper, makes sense. Is the updated code/models coming anytime soon?
In the comparison tables you have numbers for fine-tune at higher res. Definitely want to see those increases, but even just validating the same model at a higher res, if everything is working well you should see same to improved (train-test discrep) val numbers when you increase up to 20-30% or so higher than original res before it drops off (where fine-tune is then needed). Appears to hold for most other vit / vit-hybrid arch with high augmentations during pretrain. MAE might have a different impact there.