mistral.rs
mistral.rs copied to clipboard
Blazingly fast LLM inference.
**Describe the bug** Models seam to produce garbled output on very long prompts. If i use the following script: ```python import openai from transformers import AutoTokenizer if __name__ == "__main__":...
If this works, we can extend it to the other models. Hopefully, this will fix the problem in #339 for models without sliding window attention.
## Describe the bug If they number of device layers exceed the models, then the host layers to assign seems to wrap/overflow instead of the expected `0`. **NOTE:** With `llama-cpp`...
It would be nice to a stable (or versioned) C api and provide a way to compiled shared and static libraries so one can created bindings for various othe languages....
**Describe the bug** Running model from a GGUF file using [llama.cpp](https://github.com/ggerganov/llama.cpp) is very straightforward, just like that: `server -v -ngl 99 -m Phi-3-mini-4k-instruct-Q6_K.gguf` and if model is supported, it just...
**Describe the bug** it does not support some old hardware. Can it just convert bfloat16 to float16 before loading model. just like vllm is doing?
## Describe the bug Building `mistral.rs` with the `cuda` feature, when I test it with `mistralrs-bench` and a local GGUF I observed via `nvidia-smi` that layers were allocated to vRAM,...
Bug: I am attempting to run mistral rs for inference for my own GGUF files but before that I wanted to test with the example given in the documentation. I...