transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Add Prismatic VLMs to Transformers

Open siddk opened this issue 9 months ago • 5 comments

Model description

Hi! I'm the author of "Prismatic VLMs", our upcoming ICML paper that introduces and ablates design choices of visually-conditioned language models that are similar to LLaVa or InstructBLIP, introducing ~50 new VLMs at the 3B/7B/13B scale that are trained with:

  • Different Visual Representations (CLIP, SigLIP, DINOv2, fusions thereof like SigLIP + DINOv2)
  • Different LLM Backbones (LLaMa2, Vicuña v1.5, Mistral v0.1, Mistral v0.1 Instruct, Phi-2, etc.)
  • Different Data (e.g., the LLaVa v1.5 Data, LVIS-Instruct-4V, and more upcoming!)

Our best models outperform LLaVa v1.5 given the same data/same scale on a wide spectrum of different evaluation tasks; furthermore, we're seeing a lot of folks adopt our code for their research into new data mixtures, scaling to different LLM/Vision backbones, new projection mechanisms, and more.

I think it'd be awesome to support these in transformers -- especially to tap into existing tooling for loading quantized versions of models, using PEFT and other tools in the HF ecosystem for adaptation/fine-tuning, and general usability of our trained models.

While we have 50+ checkpoints (all open-sourced, and loadable in our library), all currents models share a pretty common interface of using some pretrained visual extractor from timm, a XXXForCausalLM from transformers, and a lightweight nn.Module to project visual features into the LLM embedding space. As such, I'm hoping to contribute a general modeling_prismatic.py class that implements PrismaticPretrainedModel and PrismaticForConditionalGeneration that properly instantiates the appropriate VLM instance using the dependencies already in transformers.


I'm happy to get started with this, following the instructions here, but would love help/advice on clean ways to support all the various image backbones / LLM backbones / preprocessing schemes, and verifying compatibility with existing HF tooling!

Open source status

  • [X] The model implementation is available
  • [X] The model weights are available

Provide useful links for the implementation

Prismatic Authors: @siddk, @ashwin-balakrishna96 (Potentially) Relevant Folks at HF: @merveenoyan @NielsRogge

Useful Links:

siddk avatar May 03 '24 14:05 siddk

I guess it would add support for HyperGAI/HPT1_5-Air-Llama-3-8B-Instruct-multimodal model in the same time right?

Extremys avatar May 03 '24 20:05 Extremys

Yeah - theoretically our models are general enough to support a ton of the new models others are releasing out of the box (including the one linked above, the original LLaVa models, etc.)

siddk avatar May 06 '24 12:05 siddk

Hi @siddk, exciting research!

The easiest, quickest and recommended way to add models is directly on the hub: https://huggingface.co/docs/transformers/custom_models. In fact, it's probably best suited in this case if there's some different implementation details for the different checkpoints.

This means, once working, the model can be found and used immediately without having to go through the PR process. We find this is a lot quicker as the bar for adding code into the library is high due to the maintenance cost of every new model, and so reviews take quite a while.

We have as much support as we can there - let us know if anything isn't working or you need help adding them 🤗

amyeroberts avatar May 07 '24 20:05 amyeroberts

Thanks so much @amyeroberts — just to clarify, I can register these models with the same XXXForConditionalGeneration API used by models like LLaVA right?

Ideally want things like HF Trainer, PEFT, and Bits-n-Bytes integration to all work out of the box!

siddk avatar May 08 '24 11:05 siddk

@siddk yep! The only difference for using compared to e.g. BERT is passing in trust_remote_code=True in the from_pretrained call

amyeroberts avatar May 08 '24 12:05 amyeroberts