mistral.rs
mistral.rs copied to clipboard
Blazingly fast LLM inference.
This ``` $ ./target/profiling/mistralrs-bench -r 5 -c 1,2,4 gguf -t mistralai/Mistral-7B-Instruct-v0.1 -m The Bloke/Mistral-7B-Instruct-v0.1-GGUF -f mistral-7b-instruct-v0.1.Q4_K_M.gguf 2024-04-28T05:58:00.751771Z INFO mistralrs_bench: avx: true, neon: false, simd128: false, f16c: true 2024-04-28T05:58:00.751790Z INFO mistralrs_bench:...
Now that we're sampling fully in CPU, we should not merge the sampling timings into completion timings. This will likely show an improvement on `mistralrs-bench`'s tg test. Notice `llama-bench` selects...
Speculative decoding: https://arxiv.org/pdf/2211.17192 This will refactor the pipeline structure to make the sampling process more abstracted. Additionally, it will also abstract the scheduling and kv cache management. # Restriction -...
Also enable logging for pyo3 bindings.
**Describe the bug** Running a docker build seems to fail with the error `failed to read /mistralrs/mistralrs-bench/Cargo.toml` ``` [+] Building 2.0s (18/20) docker:default => CACHED [mistralrs internal] load git source...
**Describe the bug** This affects models which use sliding window attention, but only when the sequence length is great enough (seq_len > sliding_window) to need the sliding window. This will...
Fixes #247 Since we now depend on `pyo3` in `core` we need to include `libpython` in our runtime container. Maybe we could put this `pyo3` dependency behind a feature flag...