pytorch-image-models icon indicating copy to clipboard operation
pytorch-image-models copied to clipboard

[FEATURE] Add image backbones from `MobileCLIP` paper

Open rsomani95 opened this issue 1 year ago • 3 comments

MobileCLIP is a really fast CLIP architecture for mobile inference - about 3x faster than the fastest publicly available CLIP backbone convnext_base_w for inference on iOS / macOS devices.

They introduce 3 novel image backbones: mci{0|1|2}. It would be amazing if these models were available directly via timm. I believe this would be an essential first step towards getting it into open_clip for fine-tuning.

The arch, defined here, uses MobileOne and FastVIT components, which are already available in timm. I'm not sure how compatible the re-implementation there is with the existing one in timm out of the box, but it smells like integration is definitely possible.

rsomani95 avatar Mar 16 '24 10:03 rsomani95

@rsomani95 the components themselves are equivalent at a functional level, but the naming was remapped, so would have to remap for this model as well...

rwightman avatar Mar 18 '24 19:03 rwightman

@rsomani95 I took a closer look at this s1/s2 (mc1/mc2) are the easiest, could probably map those to OpenCLIP w/ a timm FastViT encoder (after a few additions and a key remapping for weights). I think the text encoder for those is compatible.

S0 uses a repmixer based text encoder so would need new code in OpenCLIP as well. The image encoder would map to a tweaked ver of FastViT.

The B model uses a ViT w/ a different stem, doable. I really like ViT NOT having BatchNorm though so a shame that it's now a ViT Base w/ BN in the stem.

rwightman avatar Mar 21 '24 20:03 rwightman

@rwightman thanks for looking into that. That's really great to hear re. s1/s2 as those, in my eyes, sit in the perfect sweetspot of speed + accuracy. Given your observations, maybe it makes sense to port those two alone first? Is there something in particular I could help with?

rsomani95 avatar Mar 21 '24 20:03 rsomani95

@rwightman Apple just released timm and OpenCLIP checkpoints: https://huggingface.co/collections/apple/mobileclip-models-datacompdr-data-665789776e1aa2b59f35f7c8

rsomani95 avatar Jun 14 '24 16:06 rsomani95

@rsomani95 yup, I was co-ordinating with them to set it up. timm and OpenCLIP are already pointing at those checkpoints.

rwightman avatar Jun 14 '24 17:06 rwightman

also worth pointing out, timm is supporting all of the models, incl s0 as it's image-tower only. OpenCLIP isn't supporting S0 because it is too much extra work to support the RepMixer based text tower for just that one model. The other models have a standard text-tower.

rwightman avatar Jun 14 '24 17:06 rwightman

Awesome. Excited to use these! Thanks for helping out with that.

rsomani95 avatar Jun 14 '24 19:06 rsomani95