Lukas Kreussel
Lukas Kreussel
@EricLBuehler Thanks for adding this. My main usecase of `mistral.rs` is using it as an async server alternative to `ollama` and i can only provide my opinions on the server...
I'll look into it, `mistralrs-core` now also seams to depend on pyo3 so i also have to add python to the builder containers.
@sammcj Yeah, the default entry point currently only sets the port and hf_token. Since there are a lot of options to load a model into the server the containers expect...
Correct me if im wrong, but from having a quick look at the paper. LIMA seems to be a different finetuning approach which doesnt modify the underlying model architecture. If...
Maybe we could change the callback to work with the actual tokens instead of the decoded string, that should make detecting the correct stop sequence simpler or is there a...
See https://github.com/rustformers/llm/pull/325. Cublas/Clblast acceleration isn't currently supported on the main branch, meaning you can build with the acceleration enabled but it wont accelerate the inference. Also only `llama` based models...
Yeah i was planning to create a table in the "accelerators docu" which shows which architecture can be accelerated by which gpu backend, as it's likely that some models will...
`rustformers` uses `llama.cpp` as it's ggml source, feel free to create an PR including this change, seams like you only need to adjust the `build.rs` of the `ggml-sys` create. I...
Just adding that i saw the exact same behaviour, with the cpu only image. The problem even seams to get worse if i try to pass in a batch of...
I tested TEI against both a f32 and a f16 model, f16 models seam to be a bit slower than their f32 counterpart but its not significant. TEI seams to...