Eric Buehler

@huggingface

Results 543 comments of


                                            Eric Buehler

Server crashes while processing 2 concurrent requests

I was able to reproduce the error by running the following in quick succession. ```bash curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer EMPTY" \ -d '{ "model":...

Batched & chunked prefill

@lucasavila00, this looks great. It'll require modifying the attention mask calculation of every model, so it may be helpful to factor those out into a `layers.rs` in `mistralrs-core`.

Batched & chunked prefill

@lucasavila00, I am actually going to end up adding this in #242.

Quantized Mistral: Batching is slower than non batches

Yes, I've been tracking that. I have merged the upstream changes now, so it should be faster.

Quantized Mistral: Batching is slower than non batches

Ah, that could be it. Looking forward to the Candle implementation, maybe we can author a PR.

Quantized Mistral: Batching is slower than non batches

Refs huggingface/candle#2075

Quantized Mistral: Batching is slower than non batches

I think the llama.cpp issue described performance regressions after BS=4.

Quantized Mistral: Batching is slower than non batches

I can add the specialized kernels on our branch, do you think that would be good? I wonder why llama.cpp moved from 8 to 4, 5370 did not specify a...

Quantized Mistral: Batching is slower than non batches

Great, I'll add it tomorrow.

Quantized Mistral: Batching is slower than non batches

Refs https://github.com/huggingface/candle/pull/2077

‹
1
2
3
4
5
6
7
8
9
10
...
54
55
›