Eric Buehler comments

Results 543 comments of


                                            Eric Buehler

Quantized Mistral: Batching is slower than non batches

@lucasavila00, using `mistralrs-bench`, it is consistent up to BS=8 as expected with the new kernels. I ran the tests on an A10. ``` +------------------------------------+---------+--------+-----------------+--------------+-------------+--------------+ | model | backend | test...

Quantized Mistral: Batching is slower than non batches

One thing that may be causing the slowdown is the fact that we sample sequentially, so the sampling step gets much slower.

LoRA swapping at runtime

Hi @BHX2, thank you for raising this. I am considering 2 options for implementing this and wanted your opinion: 1) At startup time, tell `mistral.rs` which adapters that can be...

LoRA swapping at runtime

Great, thanks for your feedback. I think I will add the preloading to the ordering file and then expose the activation api in the HTTP request or the Request objects...

LoRA swapping at runtime

@BHX2, @LLukas22: I have a working implementation in #262 (on the `lora_swapping` branch) of LoRA swapping at runtime. Currently, the only missing feature is that there is no way to...

LoRA swapping at runtime

> I think processing multiple different adapters in a single batch is a bit overkill (but it would be nice if the implementation isn't to complicated). For now we could...

LoRA swapping at runtime

We use the adapter model ordering file [here](https://github.com/EricLBuehler/mistral.rs/blob/lora_swapping/docs/ADAPTER_MODELS.md#adapter-ordering-file). It can be used for both LoRA and X-LoRA, so when using LoRA it does not load the classifier. I'll be adding...

Eric Buehler

Quantized Mistral: Batching is slower than non batches

Quantized Mistral: Batching is slower than non batches

LoRA swapping at runtime

LoRA swapping at runtime

LoRA swapping at runtime

LoRA swapping at runtime

LoRA swapping at runtime

Quantized Mistral: Prompt processing slower than llama.cpp

Quantized Mistral: Prompt processing slower than llama.cpp

Quantized Mistral: Prompt processing slower than llama.cpp