pytorch-image-models
pytorch-image-models copied to clipboard
Add ViTamin models
Add the ViTamin model, which is trained on public DataComp-1B using OpenCLIP framework and obtains 82.9% zero-shot ImageNet-1K accuracy with 436M parameters. It achieves the state-of-the-art performance on zero-shot image classification, multi-modal retrieval, open-vocabulary detection and segmentation, and large multi-model models.
The code of ViTamin models are modified from vision_transformer_hybrid.py in the timm codebase.
This ViTamin work has been accepted to CVPR 2024 (https://arxiv.org/pdf/2404.02132).
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
@Beckschen thanks, probably a few more changes before the tests pass, if you get stuck I can help in a few days, for starter current failure, the dataclass init needs to use the default factory pattern as here: https://github.com/huggingface/pytorch-image-models/blob/main/timm/models/maxxvit.py#L137`
Thanks very much, Ross @rwightman ! I've fixed the issue with the dataclass initialization. Could you please review it before proceeding with the merge? Thanks again!
@Beckschen this required more changes so I've continued in another PR #2193 (which pulls these commits and adds my own), including an addition to the base vit model for xlarge (disable pos embed). I think it's working now but haven't done extensive checks... can add support to OpenCLIP now fairly easily, easier to verify it's correct there.
I'm truly grateful for your help, @rwightman ! I saw there are changes regarding the compatibility with vision_transformer.py and vision_transformer_hybrid.py . Thanks again!
The version is designed to support both timm and OpenCLIP. Thanks for merging the model configs in OpenCLIP.
Thanks again, @rwightman !
Best regards, Jieneng