mistral.rs icon indicating copy to clipboard operation
mistral.rs copied to clipboard

Add support for Idefics 2

Open EricLBuehler opened this issue 9 months ago • 1 comments

This PR adds support for our first multimodal model: Idefics 2 (https://huggingface.co/HuggingFaceM4/idefics2-8b)!

Implementation TODOs:

  • [x] VisionTransformer
    • [x] Encoder
      • [x] Attention
      • [x] MLP
    • [x] VisionEmbeddings (pending issue 2185 or a Tensor::bucketize function)
  • [x] Connector
    • [x] MLP
    • [x] PerceiverLayer
  • [x] Model
    • [x] Forward pass
      • [x] Remove padding images
      • [x] Generate the patch/pixel attention mask (pending a Tensor::unfold function)
      • [x] Run vision submodel and connector submodel
        • [x] Allow Mistral to run trained embedding head on any input tokens
        • [x] Inputs merger to inject embeddings correctly
      • Pass input to Mistral model
        • [x] Allow Mistral to take an embeddings vector instead of using the trained embedding head.
  • [x] Image processor analogous to Idefics2ImageProcessor
    • [x] Resizing
    • [x] Rescaling
    • [x] Normalization
    • [x] Padding
      • [x] Generate pixel attention mask for padded images
        • [x] Pass and use in input injection
    • [x] Create pixel values tensors
  • [x] Vision Model Pipeline
    • [x] Add a VisionModel trait similar to NormalModel
    • [x] Add a ModelCategory: vision, text, embedding etc
    • [x] Handle sequence scheduling with image dimensions
    • [x] Abstract input preparation logic
      • [ ] Handle padding to same, resized shape, across batch dimension
  • [x] Add proper handling of chat templates
    • [x] Load preprocessor/processor config JSON files
    • [x] Support configuration of inputs processor via preprocessor
  • [x] Messages API generalization
    • [x] Support OpenAI compatible method of specifying images
    • [x] Update messages to optionally encode type (akin to examples here).
    • [x] Use processor config to abstract the chat template application process
  • [x] HTTP API
    • [x] Handle decoding from base64
    • [x] Support loading from HTTP.
  • [ ] Rust API
  • [ ] Python API

Other TODOs:

  • [x] Introduce model type enum to reject mixing of text/multimodal models in speculative decoding
    • Perhaps introduce VisionModel akin to NormalModel.
  • [ ] Ergonomic API support (OpenAI compatible on the HTTP side, but hopefully nicer on the Rust/Python side)
  • [ ] Support device mapping
  • [ ] Support ISQ

Pending issues:

  • huggingface/candle#2185
  • mokeyish/candle-ext#7

EricLBuehler avatar May 15 '24 01:05 EricLBuehler