vision
vision copied to clipboard
Add vision-language models
🚀 The feature
Add support for vision-language models like CLIP or LIT.
Motivation, pitch
Dear torchvision team, I am sorry if I missed discussions about this or a specific reason why you have chosen not to implement vision language models. The current trend in compute vision is heavily drifting towards vision language models like CLIP. It might be a consideration to add support for at least some of these models.
Alternatives
No response
Additional context
No response
Hi @trawler0 , thanks for the feature request. We certainly acknowledge the prevalence of vision-language models, but at this time we're not prioritizing the addition of new models in torchvision and instead focus on the lower parts of the stack like preproc.