keras-nlp
keras-nlp copied to clipboard
Add BLIP model
Is your feature request related to a problem? Please describe.
BLIP: Bootstrapping Language-Image Pre-training (2022 ), is a model that is able to perform various multi-modal tasks including
- Image Captioning
- Visual Question Answering
- Image-Text retrieval (Image-text matching)
(Cited by 247, until now)
Describe the solution you'd like
Info: I've went through the source code of official BLIP repo, to its image-captioning mode, and found that most of the codes are taken from huggingface-transformer and modified with their proposed solutions. The nlp component of BLIP is BERT in most part. Shortly
As HF also provides TF-BERT as well, its straightforward to translate code but with KerasNLP-BERT, it might need extra care.
Describe alternatives you've considered
none.
Additional context
- For BLIP, the CV component is only Vision Transformer for feature extraction. KerasCV provides the vit model. The larger part of BLIP model consist of NLP component (specifically BERT).
- BLIP 2 - HF-Blog
update
TF version of BLIP is added to Huggingface-Transformer ❤️ cc. @Rocketknight1