pytorch-image-models
pytorch-image-models copied to clipboard
[Feature] Make (non `rmlp`) MaxxViT to support variable resolution
Dear rwightman, thanks for you job. I was going to input a tensor with size (16,3 112,112) to test the MaxxViT small 224 model, but it failed . do you have any solutions ?
@mywebinfo65536 Some of the maxvit models are currently resizable, any of the models with rmlp in the name will resize if you pass img_size the model on creation, BUT it must be divisible by 32, so 96 and 128 work, but not 112. This model can do that: maxvit_rmlp_tiny_rw_256
It is possible to add resize support for the tf and other maxvit models, but they require extra code to interpolate the position embeddings which is a bit of work. Any help appreciated here, both the Tensorflow style pos embed and the Swin style would require different interpolation impl.
Hey @rwightman, do you still want someone to take a look at this?
@xvr-hlt yeah, if someone wants to tackle the interpolation code (a bit different for both the TF style and swin style) that'd be great.
There's some code to base this on. The TF one can resize on the fly, the swin approach is done on pretrained weight load to a model initialized with different size.
https://github.com/microsoft/Swin-Transformer/blob/f92123a0035930d89cf53fcb8257199481c4428d/utils.py#L61
and
https://github.com/google-research/maxvit/blob/main/maxvit/models/maxvit.py#L198-L237
@rwightman it might be a good time to move this convo into a WIP PR, but while we're here: did you imagine this was a solution specifically for the maxxvit models defined here, or something more general purpose? If it's the former, is there any way of knowing which config models are SWIN style, and which are tf? I'm assuming the tf models are anything that composes from the _tf_cfg dict, but what about the SWIN style models?
maxvit, coatnet, and swin models support resize on load now, it would be possible to support dynamic resize as well but would have a small runtime penalty and more complexity so sticking with resize on load