pytorch-image-models icon indicating copy to clipboard operation
pytorch-image-models copied to clipboard

Add ViTamin models

Open Beckschen opened this issue 1 year ago • 3 comments

Add the ViTamin model, which is trained on public DataComp-1B using OpenCLIP framework and obtains 82.9% zero-shot ImageNet-1K accuracy with 436M parameters. It achieves the state-of-the-art performance on zero-shot image classification, multi-modal retrieval, open-vocabulary detection and segmentation, and large multi-model models.

The code of ViTamin models are modified from vision_transformer_hybrid.py in the timm codebase.

This ViTamin work has been accepted to CVPR 2024 (https://arxiv.org/pdf/2404.02132).

Beckschen avatar May 05 '24 06:05 Beckschen

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@Beckschen thanks, probably a few more changes before the tests pass, if you get stuck I can help in a few days, for starter current failure, the dataclass init needs to use the default factory pattern as here: https://github.com/huggingface/pytorch-image-models/blob/main/timm/models/maxxvit.py#L137`

rwightman avatar May 05 '24 16:05 rwightman

Thanks very much, Ross @rwightman ! I've fixed the issue with the dataclass initialization. Could you please review it before proceeding with the merge? Thanks again!

Beckschen avatar May 14 '24 19:05 Beckschen

@Beckschen this required more changes so I've continued in another PR #2193 (which pulls these commits and adds my own), including an addition to the base vit model for xlarge (disable pos embed). I think it's working now but haven't done extensive checks... can add support to OpenCLIP now fairly easily, easier to verify it's correct there.

rwightman avatar Jun 04 '24 00:06 rwightman

I'm truly grateful for your help, @rwightman ! I saw there are changes regarding the compatibility with vision_transformer.py and vision_transformer_hybrid.py . Thanks again!

The version is designed to support both timm and OpenCLIP. Thanks for merging the model configs in OpenCLIP.

Thanks again, @rwightman !

Best regards, Jieneng

Beckschen avatar Jun 07 '24 20:06 Beckschen