mistral.rs
mistral.rs copied to clipboard
Add support for Idefics 2
This PR adds support for our first multimodal model: Idefics 2 (https://huggingface.co/HuggingFaceM4/idefics2-8b)!
Implementation TODOs:
- [x] VisionTransformer
- [x] Encoder
- [x] Attention
- [x] MLP
- [x] VisionEmbeddings (pending issue 2185 or a
Tensor::bucketize
function)
- [x] Encoder
- [x] Connector
- [x] MLP
- [x] PerceiverLayer
- [x] Model
- [x] Forward pass
- [x] Remove padding images
- [x] Generate the patch/pixel attention mask (pending a
Tensor::unfold
function) - [x] Run vision submodel and connector submodel
- [x] Allow
Mistral
to run trained embedding head on any input tokens - [x] Inputs merger to inject embeddings correctly
- [x] Allow
- Pass input to
Mistral
model- [x] Allow Mistral to take an embeddings vector instead of using the trained embedding head.
- [x] Forward pass
- [x] Image processor analogous to
Idefics2ImageProcessor
- [x] Resizing
- [x] Rescaling
- [x] Normalization
- [x] Padding
- [x] Generate pixel attention mask for padded images
- [x] Pass and use in input injection
- [x] Generate pixel attention mask for padded images
- [x] Create pixel values tensors
- [x] Vision Model Pipeline
- [x] Add a
VisionModel
trait similar toNormalModel
- [x] Add a
ModelCategory
: vision, text, embedding etc - [x] Handle sequence scheduling with image dimensions
- [x] Abstract input preparation logic
- [ ] Handle padding to same, resized shape, across batch dimension
- [x] Add a
- [x] Add proper handling of chat templates
- [x] Load preprocessor/processor config JSON files
- [x] Support configuration of inputs processor via preprocessor
- [x] Messages API generalization
- [x] Support OpenAI compatible method of specifying images
- [x] Update messages to optionally encode type (akin to examples here).
- [x] Use processor config to abstract the chat template application process
- [x] HTTP API
- [x] Handle decoding from base64
- [x] Support loading from HTTP.
- [ ] Rust API
- [ ] Python API
Other TODOs:
- [x] Introduce model type enum to reject mixing of text/multimodal models in speculative decoding
- Perhaps introduce
VisionModel
akin toNormalModel
.
- Perhaps introduce
- [ ] Ergonomic API support (OpenAI compatible on the HTTP side, but hopefully nicer on the Rust/Python side)
- [ ] Support device mapping
- [ ] Support ISQ
Pending issues:
- huggingface/candle#2185
- mokeyish/candle-ext#7