mistral.rs Add support for Idefics 2

Add support for Idefics 2

Open EricLBuehler opened this issue 9 months ago • 1 comments

This PR adds support for our first multimodal model: Idefics 2 (https://huggingface.co/HuggingFaceM4/idefics2-8b)!

Implementation TODOs:

[x] VisionTransformer
- [x] Encoder
  - [x] Attention
  - [x] MLP
- [x] VisionEmbeddings (pending issue 2185 or a Tensor::bucketize function)
[x] Connector
- [x] MLP
- [x] PerceiverLayer
[x] Model
- [x] Forward pass
  - [x] Remove padding images
  - [x] Generate the patch/pixel attention mask (pending a Tensor::unfold function)
  - [x] Run vision submodel and connector submodel
    - [x] Allow Mistral to run trained embedding head on any input tokens
    - [x] Inputs merger to inject embeddings correctly
  - Pass input to Mistral model
    - [x] Allow Mistral to take an embeddings vector instead of using the trained embedding head.
[x] Image processor analogous to Idefics2ImageProcessor
- [x] Resizing
- [x] Rescaling
- [x] Normalization
- [x] Padding
  - [x] Generate pixel attention mask for padded images
    - [x] Pass and use in input injection
- [x] Create pixel values tensors
[x] Vision Model Pipeline
- [x] Add a VisionModel trait similar to NormalModel
- [x] Add a ModelCategory: vision, text, embedding etc
- [x] Handle sequence scheduling with image dimensions
- [x] Abstract input preparation logic
  - [ ] Handle padding to same, resized shape, across batch dimension
[x] Add proper handling of chat templates
- [x] Load preprocessor/processor config JSON files
- [x] Support configuration of inputs processor via preprocessor
[x] Messages API generalization
- [x] Support OpenAI compatible method of specifying images
- [x] Update messages to optionally encode type (akin to examples here).
- [x] Use processor config to abstract the chat template application process
[x] HTTP API
- [x] Handle decoding from base64
- [x] Support loading from HTTP.
[ ] Rust API
[ ] Python API

Other TODOs:

[x] Introduce model type enum to reject mixing of text/multimodal models in speculative decoding
- Perhaps introduce VisionModel akin to NormalModel.
[ ] Ergonomic API support (OpenAI compatible on the HTTP side, but hopefully nicer on the Rust/Python side)
[ ] Support device mapping
[ ] Support ISQ

Pending issues:

huggingface/candle#2185
mokeyish/candle-ext#7

May 15 '24 01:05 EricLBuehler

mistral.rs mistral.rs copied to clipboard

Add support for Idefics 2

mistral.rs
mistral.rs copied to clipboard