mistral.rs
mistral.rs copied to clipboard
Blazingly fast LLM inference.
https://github.com/flame/blis Jeff Diamond, I think from Oracle, did optimizations for ARM in the Blis library. Any plans for supporting that or other BLAS like libraries? CPU inference on Ampere is...
My installation is based on [c2ff402](https://github.com/EricLBuehler/mistral.rs/commit/c2ff4027824f71d59ebcc6a4aad87f099865a348) ```shell ./mistralrs-server -i plain -m microsoft/Phi-3-small-8k-instruct -a phi3 ``` I run into the following error: ``` Could not get file "tokenizer.json" from API: RequestError(Status(404,...
This reports mistral.rs as being faster than llama.cpp: https://github.com/EricLBuehler/mistral.rs/discussions/612 But I'm seeing much slower speeds for the same prompt/settings. Mistral.rs ```Usage { completion_tokens: 501, prompt_tokens: 28, total_tokens: 529, avg_tok_per_sec: 16.980707,...
## Describe the bug Running in the MacBook M2 Pro Metal mode is too slow, and it becomes incredibly slow when the issue is slightly more complex. Even to the...
What is the current status for providing prebuilt for providing python bindings? If prebuilt binary is provided, this would be really beneficial in terms of download/compile time of python bindings....
## Describe the bug ```bash cargo run --features metal --package mistralrs-server --bin mistralrs-server -- --token-source cache -i plain -m microsoft/Phi-3.5-mini-instruct -a phi3 --dtype bf16 ``` error message ```bash .4800033569336, 64.51000213623047,...
I noticed you guys forked a bunch of controller code from AICI for your constraints. I think you might be interested in https://github.com/microsoft/llguidance - it implements a more general constraint...
## Describe the bug I'm trying to run mistralrs on a VRAM-constrained system (16 GB VRAM, 64 GB RAM), via the docker image. ```bash ghcr.io/ericlbuehler/mistral.rs:cuda-80-0.3 ``` The arguments for the...
## Describe the bug When using the mistralrs library to process multiple requests in a loop, the blocking_recv call hangs indefinitely after the first iteration. This prevents the code from...
Beam search could be very valuable for non-creative generation.