mistral.rs
mistral.rs copied to clipboard
Blazingly fast LLM inference.
I added some code that prints the queue state: https://github.com/EricLBuehler/mistral.rs/pull/138 I ran it on a single generation: ``` 2024-04-14T17:34:50.601969Z INFO mistralrs_core::engine: Prompt[] Completion[210] - 21ms ``` And on batches: ```...
Since generation speed is almost matching llama.cpp after https://github.com/EricLBuehler/mistral.rs/pull/152 I think it's worth it trying to optimize prompt processing now.
- [ ] RowParallelLinear - [ ] MergedColumnParallelLinear - [ ] QKVParallelLinear
Refs and closes #215. # Api addition - DeviceMapper - All at-loading-time methods have `loading_isq` parameter - Add `fn set_nm_device, loading_isq: bool) -> VarBuilder
Argsort was just added to Candle (https://github.com/huggingface/candle/pull/2132). Using an argsort kernel will accelerate the current CPU sorting part of `topk` or `topp` sampling, which takes a lot of time.
Closes https://github.com/EricLBuehler/mistral.rs/issues/235
Continuing https://github.com/EricLBuehler/mistral.rs/pull/219 Closes https://github.com/EricLBuehler/mistral.rs/issues/216
I'm creating this issue to track work on adding async channels to avoid blocking in the server. https://github.com/EricLBuehler/mistral.rs/pull/233 was reverted
I found it while testing https://github.com/EricLBuehler/mistral.rs/pull/236