lorax
lorax copied to clipboard
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
### Feature request hello,our models are deploying with TGI(v1.4.3), and we alse want to use lorax. But I find that the tgi version lorax is based on is very different...
Currently, we treat each of the Q, K, V LoRAs as distinct tensors, meaning we do 3 SGMV calls per layer instead of 1. We should fuse them to improve...
[Repo](https://github.com/stanfordnlp/pyreft) [Paper](https://arxiv.org/abs/2404.03592)
### Feature request EETQ quantized model perform with very good quality in my case, but the loading is pretty slow. So that if the base model is quantized with EETQ...
Any failure in SGMV comes back as `Request failed during generation: Server error: No suitable kernel. dtype=Half` From Discord: > I have tried the finetune adapter for llama2-7b. I trained...
`max_batch_prefill_tokens` is now optional, and it is defaulting to the value from `max_input_length`. Now, users can define custom `max_input_length` without having to also specify `max_batch_prefill_tokens`
Due to the dynamic activation logic of Mixtral (and any other MoE), the CUDA graph compilation logic (which assumes deterministic execution) results in garbage outputs when enabled for this model...