lorax icon indicating copy to clipboard operation
lorax copied to clipboard

Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs

Results 185 lorax issues
Sort by recently updated
recently updated
newest added

### Feature request hello,our models are deploying with TGI(v1.4.3), and we alse want to use lorax. But I find that the tgi version lorax is based on is very different...

question

Currently, we treat each of the Q, K, V LoRAs as distinct tensors, meaning we do 3 SGMV calls per layer instead of 1. We should fuse them to improve...

enhancement

[Repo](https://github.com/stanfordnlp/pyreft) [Paper](https://arxiv.org/abs/2404.03592)

enhancement

### Feature request EETQ quantized model perform with very good quality in my case, but the loading is pretty slow. So that if the base model is quantized with EETQ...

enhancement

Any failure in SGMV comes back as `Request failed during generation: Server error: No suitable kernel. dtype=Half` From Discord: > I have tried the finetune adapter for llama2-7b. I trained...

enhancement

`max_batch_prefill_tokens` is now optional, and it is defaulting to the value from `max_input_length`. Now, users can define custom `max_input_length` without having to also specify `max_batch_prefill_tokens`

Due to the dynamic activation logic of Mixtral (and any other MoE), the CUDA graph compilation logic (which assumes deterministic execution) results in garbage outputs when enabled for this model...

bug