Add multi-modal method(s)
Hello guys, Thanks for this amazing repo, it is very useful for me. I wanted to ask if there is interest in implementing methods like CLIP for image-language pretraining. I understand that this might not be your actual focus and that web-scale-pretraining might be out of reach, however the paper https://arxiv.org/abs/2305.08675 shows that one can actually get relatively high zero-shot accuracies with effort roughly equal to imagenet pretraining.
Hi! Multi-modal is definitely something we would like to incorporate. There are two main components missing for this: Data loading for text, and NLP models/tokenizers. For both cases we have to decide how to support them. This was quite easy for vision because data loading is pretty standardized and models are in torchvision. For text the landscape is more diverse and we'll have to compare the libraries first. Please let us know if you have any suggestions/inputs!