transformers
transformers copied to clipboard
Add Prismatic VLMs to Transformers
Model description
Hi! I'm the author of "Prismatic VLMs", our upcoming ICML paper that introduces and ablates design choices of visually-conditioned language models that are similar to LLaVa or InstructBLIP, introducing ~50 new VLMs at the 3B/7B/13B scale that are trained with:
- Different Visual Representations (
CLIP
,SigLIP
,DINOv2
, fusions thereof likeSigLIP + DINOv2
) - Different LLM Backbones (
LLaMa2
,Vicuña v1.5
,Mistral v0.1
,Mistral v0.1 Instruct
,Phi-2
, etc.) - Different Data (e.g., the LLaVa v1.5 Data, LVIS-Instruct-4V, and more upcoming!)
Our best models outperform LLaVa v1.5 given the same data/same scale on a wide spectrum of different evaluation tasks; furthermore, we're seeing a lot of folks adopt our code for their research into new data mixtures, scaling to different LLM/Vision backbones, new projection mechanisms, and more.
I think it'd be awesome to support these in transformers
-- especially to tap into existing tooling for loading quantized versions of models, using PEFT
and other tools in the HF ecosystem for adaptation/fine-tuning, and general usability of our trained models.
While we have 50+ checkpoints (all open-sourced, and loadable in our library), all currents models share a pretty common interface of using some pretrained visual extractor from timm
, a XXXForCausalLM
from transformers
, and a lightweight nn.Module
to project visual features into the LLM embedding space. As such, I'm hoping to contribute a general modeling_prismatic.py
class that implements PrismaticPretrainedModel
and PrismaticForConditionalGeneration
that properly instantiates the appropriate VLM instance using the dependencies already in transformers
.
I'm happy to get started with this, following the instructions here, but would love help/advice on clean ways to support all the various image backbones / LLM backbones / preprocessing schemes, and verifying compatibility with existing HF tooling!
Open source status
- [X] The model implementation is available
- [X] The model weights are available
Provide useful links for the implementation
Prismatic Authors: @siddk, @ashwin-balakrishna96 (Potentially) Relevant Folks at HF: @merveenoyan @NielsRogge
Useful Links:
I guess it would add support for HyperGAI/HPT1_5-Air-Llama-3-8B-Instruct-multimodal model in the same time right?
Yeah - theoretically our models are general enough to support a ton of the new models others are releasing out of the box (including the one linked above, the original LLaVa models, etc.)
Hi @siddk, exciting research!
The easiest, quickest and recommended way to add models is directly on the hub: https://huggingface.co/docs/transformers/custom_models. In fact, it's probably best suited in this case if there's some different implementation details for the different checkpoints.
This means, once working, the model can be found and used immediately without having to go through the PR process. We find this is a lot quicker as the bar for adding code into the library is high due to the maintenance cost of every new model, and so reviews take quite a while.
We have as much support as we can there - let us know if anything isn't working or you need help adding them 🤗
Thanks so much @amyeroberts — just to clarify, I can register these models with the same XXXForConditionalGeneration
API used by models like LLaVA right?
Ideally want things like HF Trainer, PEFT, and Bits-n-Bytes integration to all work out of the box!
@siddk yep! The only difference for using compared to e.g. BERT is passing in trust_remote_code=True
in the from_pretrained
call