mistral.rs
mistral.rs copied to clipboard
Blazingly fast LLM inference.
This PR implements our first embedding model: nomic-ai/nomic-embed-text-v1!
This adds flake support for [Nix](https://nixos.org/)
- [ ] Support for LongRope (this is supported with ISQ in non-GGUF models, though) - The challenge is that the scalings information is not present in the GGUF file....
**Describe the bug** I am not sure if that's a bug. Python3.10, M1. ```python from mistralrs import Runner, Which, ChatCompletionRequest, Architecture runner = Runner( Which.Plain( model_id="google/gemma-2-9b-it", repeat_last_n=64, tokenizer_json=None, arch=Architecture.Gemma, )...
- [x] Loader and model - [ ] ISQ - [ ] AnyMoE - [ ] Device Mapping - [ ] X-LoRA/LoRA - [ ] Adapter activation
[Dolphin Vision 72B](https://huggingface.co/cognitivecomputations/dolphin-vision-72b) is a fine-tune of base model [Qwen/Qwen2-72B](https://huggingface.co/Qwen/Qwen2-72B) but add vision: In this example is using transformers ```python import torch import transformers from transformers import AutoModelForCausalLM, AutoTokenizer from...
**Describe the bug** High CPU use - no GPU use - MacOS 14.4.1 - Macbookpro M1 Max 64Gb cargo build --example phi3v --release --features metal It takes minutes to execute...
This is a tracking issue for the development of AnyMoE, which will be broken up into several PRs. - [x] Core functionality, plain models, all APIs: #476 - [x] Support...
This PR adds GPTQ quantization ([paper here](https://arxiv.org/abs/2210.17323)) support. Refs: #418, #448.
## Introduction This implementation is based on my work for [candle](https://github.com/huggingface/candle). However, it incorporates some notable differences: * I have completely removed support for the model format used in the...