mistral.rs icon indicating copy to clipboard operation
mistral.rs copied to clipboard

Blazingly fast LLM inference.

Results 186 mistral.rs issues
Sort by recently updated
recently updated
newest added

**Describe the bug** If two requests are sent to the server at roughly the same time, it will start to respond to both requests and then crash with the following...

bug

My mac has an m1 chip, execute the following command: cargo run --release --features mkl -- -i plain -m meta-llama/Meta-Llama-3-8B-Instruct -a llama The following error occurs. Does it mean that...

This increases compatibility with OpenAI and llama-cpp-python. I would appreciate any thoughts on this change. # Breaking This breaks any code which uses the chat completion API as it removes...

pyo3
breaking

This also updates the loading process to track loading of shards instead of tensors. This will enable loading in Jupyter without being rate limited and hanging.

Dynamic LoRA swapping, first raised in #259, enables the user to dynamically set active LoRA adapters. This can be configured per-request to enable users to add their own routing functionality....

new feature
backend
models

A feature allowing swapping LoRA adapters at runtime could reduce the overhead for running multiple specialized model adapters. This style could either facilitate serving different models to individual users (akin...

new feature
models

love to see more rust in the AI space. i work on a tool called cargo-dist that can help package up pre-built binaries and build installers so it's easier for...

This PR adds support for our first multimodal model: Idefics 2 (https://huggingface.co/HuggingFaceM4/idefics2-8b)! **Implementation TODOs:** - [x] VisionTransformer - [x] Encoder - [x] Attention - [x] MLP - [x] VisionEmbeddings (pending...

new feature
models

There's some work being done to implement Infini-attention from https://arxiv.org/pdf/2404.07143 In a nutshell it allows for essentially an unlimited context length without incurring the quadratic penalty. There's a proof of...

**Describe the bug** quantizing large models via insitu quantization leads to out of memory issues even though the quantized final version should be able to fit in vram. **Latest commit**...

bug