Eric Buehler
Eric Buehler
I was able to reproduce the error by running the following in quick succession. ```bash curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer EMPTY" \ -d '{ "model":...
@lucasavila00, this looks great. It'll require modifying the attention mask calculation of every model, so it may be helpful to factor those out into a `layers.rs` in `mistralrs-core`.
@lucasavila00, I am actually going to end up adding this in #242.
Yes, I've been tracking that. I have merged the upstream changes now, so it should be faster.
Ah, that could be it. Looking forward to the Candle implementation, maybe we can author a PR.
Refs huggingface/candle#2075
I think the llama.cpp issue described performance regressions after BS=4.
I can add the specialized kernels on our branch, do you think that would be good? I wonder why llama.cpp moved from 8 to 4, 5370 did not specify a...
Great, I'll add it tomorrow.
Refs https://github.com/huggingface/candle/pull/2077