mistral.rs icon indicating copy to clipboard operation
mistral.rs copied to clipboard

Blazingly fast LLM inference.

Results 186 mistral.rs issues
Sort by recently updated
recently updated
newest added

Refs #258.

new feature

**Describe the bug** Models seam to produce garbled output on very long prompts. If i use the following script: ```python import openai from transformers import AutoTokenizer if __name__ == "__main__":...

bug

If this works, we can extend it to the other models. Hopefully, this will fix the problem in #339 for models without sliding window attention.

## Describe the bug If they number of device layers exceed the models, then the host layers to assign seems to wrap/overflow instead of the expected `0`. **NOTE:** With `llama-cpp`...

bug
resolved

It would be nice to a stable (or versioned) C api and provide a way to compiled shared and static libraries so one can created bindings for various othe languages....

new feature

**Describe the bug** Running model from a GGUF file using [llama.cpp](https://github.com/ggerganov/llama.cpp) is very straightforward, just like that: `server -v -ngl 99 -m Phi-3-mini-4k-instruct-Q6_K.gguf` and if model is supported, it just...

new feature

mistralrs_server should be mistralrs-server

documentation

**Describe the bug** it does not support some old hardware. Can it just convert bfloat16 to float16 before loading model. just like vllm is doing?

bug

## Describe the bug Building `mistral.rs` with the `cuda` feature, when I test it with `mistralrs-bench` and a local GGUF I observed via `nvidia-smi` that layers were allocated to vRAM,...

bug

Bug: I am attempting to run mistral rs for inference for my own GGUF files but before that I wanted to test with the example given in the documentation. I...

bug