mistral.rs issues

Quantized Mistral: Batching is slower than non batches

19

I added some code that prints the queue state: https://github.com/EricLBuehler/mistral.rs/pull/138 I ran it on a single generation: ``` 2024-04-14T17:34:50.601969Z INFO mistralrs_core::engine: Prompt[] Completion[210] - 21ms ``` And on batches: ```...

lucasavila00

Quantized Mistral: Prompt processing slower than llama.cpp

43

Since generation speed is almost matching llama.cpp after https://github.com/EricLBuehler/mistral.rs/pull/152 I think it's worth it trying to optimize prompt processing now.

lucasavila00

Need parallel linears

4

- [ ] RowParallelLinear - [ ] MergedColumnParallelLinear - [ ] QKVParallelLinear

EricLBuehler

paged-attention

backend

Implement intermediate loading for ISQ on CPU

1

Refs and closes #215. # Api addition - DeviceMapper - All at-loading-time methods have `loading_isq` parameter - Add `fn set_nm_device, loading_isq: bool) -> VarBuilder

EricLBuehler

backend

models

Async channels

2

lucasavila00

Accelerate topk, topp sampling with `argsort`

Argsort was just added to Candle (https://github.com/huggingface/candle/pull/2132). Using an argsort kernel will accelerate the current CPU sorting part of `topk` or `topp` sampling, which takes a lot of time.

EricLBuehler

optimization

Async channels

4

Closes https://github.com/EricLBuehler/mistral.rs/issues/235

lucasavila00

Batched & chunked prefill

3

Continuing https://github.com/EricLBuehler/mistral.rs/pull/219 Closes https://github.com/EricLBuehler/mistral.rs/issues/216

lucasavila00

Axum server blocking - Add async channels

I'm creating this issue to track work on adding async channels to avoid blocking in the server. https://github.com/EricLBuehler/mistral.rs/pull/233 was reverted

lucasavila00

fix

Fix quantized example

1

I found it while testing https://github.com/EricLBuehler/mistral.rs/pull/236

lucasavila00

mistral.rs
mistral.rs copied to clipboard

Metadata

Quantized Mistral: Batching is slower than non batches

Quantized Mistral: Prompt processing slower than llama.cpp

Need parallel linears

Implement intermediate loading for ISQ on CPU

Async channels

Accelerate topk, topp sampling with `argsort`

Async channels

Batched & chunked prefill

Axum server blocking - Add async channels

Fix quantized example

← Metadata

Owner

Metadata

mistral.rs mistral.rs copied to clipboard

Metadata

← Metadata

Owner

Metadata

mistral.rs
mistral.rs copied to clipboard