Eric Buehler

Results 543 comments of Eric Buehler

@lucasavila00, using `mistralrs-bench`, it is consistent up to BS=8 as expected with the new kernels. I ran the tests on an A10. ``` +------------------------------------+---------+--------+-----------------+--------------+-------------+--------------+ | model | backend | test...

One thing that may be causing the slowdown is the fact that we sample sequentially, so the sampling step gets much slower.

Hi @BHX2, thank you for raising this. I am considering 2 options for implementing this and wanted your opinion: 1) At startup time, tell `mistral.rs` which adapters that can be...

Great, thanks for your feedback. I think I will add the preloading to the ordering file and then expose the activation api in the HTTP request or the Request objects...

@BHX2, @LLukas22: I have a working implementation in #262 (on the `lora_swapping` branch) of LoRA swapping at runtime. Currently, the only missing feature is that there is no way to...

> I think processing multiple different adapters in a single batch is a bit overkill (but it would be nice if the implementation isn't to complicated). For now we could...

We use the adapter model ordering file [here](https://github.com/EricLBuehler/mistral.rs/blob/lora_swapping/docs/ADAPTER_MODELS.md#adapter-ordering-file). It can be used for both LoRA and X-LoRA, so when using LoRA it does not load the classifier. I'll be adding...

@lucasavila00, do you think we should also dequantize to F16 for large batch size? To my understanding, this beneficial because the BLAS implementation of matrix-matrix product is faster than our...

@lucasavila00, that sounds great. Please let me know the results!

@lucasavila00, that is very interesting. How did you force the dequantization?