Add vision-language models

Open trawler0 opened this issue 1 year ago • 1 comments

🚀 The feature

Add support for vision-language models like CLIP or LIT.

Motivation, pitch

Dear torchvision team, I am sorry if I missed discussions about this or a specific reason why you have chosen not to implement vision language models. The current trend in compute vision is heavily drifting towards vision language models like CLIP. It might be a consideration to add support for at least some of these models.

Alternatives

No response

Additional context

No response

May 20 '24 19:05 trawler0

Hi @trawler0 , thanks for the feature request. We certainly acknowledge the prevalence of vision-language models, but at this time we're not prioritizing the addition of new models in torchvision and instead focus on the lower parts of the stack like preproc.

May 24 '24 12:05 NicolasHug