mistral.rs
mistral.rs copied to clipboard
Blazingly fast LLM inference.
**Describe the bug** If two requests are sent to the server at roughly the same time, it will start to respond to both requests and then crash with the following...
My mac has an m1 chip, execute the following command: cargo run --release --features mkl -- -i plain -m meta-llama/Meta-Llama-3-8B-Instruct -a llama The following error occurs. Does it mean that...
This increases compatibility with OpenAI and llama-cpp-python. I would appreciate any thoughts on this change. # Breaking This breaks any code which uses the chat completion API as it removes...
This also updates the loading process to track loading of shards instead of tensors. This will enable loading in Jupyter without being rate limited and hanging.
Dynamic LoRA swapping, first raised in #259, enables the user to dynamically set active LoRA adapters. This can be configured per-request to enable users to add their own routing functionality....
A feature allowing swapping LoRA adapters at runtime could reduce the overhead for running multiple specialized model adapters. This style could either facilitate serving different models to individual users (akin...
love to see more rust in the AI space. i work on a tool called cargo-dist that can help package up pre-built binaries and build installers so it's easier for...
This PR adds support for our first multimodal model: Idefics 2 (https://huggingface.co/HuggingFaceM4/idefics2-8b)! **Implementation TODOs:** - [x] VisionTransformer - [x] Encoder - [x] Attention - [x] MLP - [x] VisionEmbeddings (pending...
There's some work being done to implement Infini-attention from https://arxiv.org/pdf/2404.07143 In a nutshell it allows for essentially an unlimited context length without incurring the quadratic penalty. There's a proof of...
**Describe the bug** quantizing large models via insitu quantization leads to out of memory issues even though the quantized final version should be able to fit in vram. **Latest commit**...