lorax icon indicating copy to clipboard operation
lorax copied to clipboard

Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs

Results 178 lorax issues
Sort by recently updated
recently updated
newest added
trafficstars

### Feature request/question Expose ENV/flag in `lorax-server` and `lorax-launcher` to set base path of adapter during inference. We currently tried to do a workaround by setting HUGGINGFACE_HUB_CACHE=/home/adapters . With reference...

enhancement
good first issue

Currently we support multiple ranks per batch via a loop, but this reduces batching effect and makes the process infeasible for CUDA graphs. Instead, we can pad our the buffers...

enhancement

**System Info:** Python - 3.11.5 Cuda - 12.2 GPU: A100, Driver Version: 535.104.05 #GPU - 2 **Command used** model=mistralai/Mistral-7B-Instruct-v0.1 volume=$PWD/data sudo podman run --gpus all --shm-size 1g -p 8080:80 -v...

question

Good thread on it here: https://www.reddit.com/r/LocalLLaMA/comments/1bgej75/control_vectors_added_to_llamacpp/ Given how parameter efficient control vectors are, they're a perfect candidate for something like LoRAX where you might want to serve many different such...

enhancement

Following error occurred at request time: ``` CUDA error: an illegal memory access was encountered ``` Repro context: - Mixtral-8x7b - Adapter (rank 8) - Long prompt - Sharded (2+...

bug

### Feature request When streaming a prompt response, the last message does not include the time to process the request. Would like to request that we include that information in...

enhancement
good first issue

Hi - How do I call OpenAI GPT (say gpt-4) and Google Gemini models through LoRAX? An example code snippet would really help. Thanks, Sekhar H.

question

Here is my code: ``` model=/data/vicuna-13b/vicuna-13b-v1.5/ docker run --gpus all --shm-size 1g -p 8080:80 -v /data/:/data \ ghcr.io/predibase/lorax:latest --model-id $model --sharded true --num-shard 2 \ --adapter-id baruga/alpaca-lora-13b ``` Here is...

question

### Feature request https://github.com/huggingface/text-generation-inference/issues/1633 ### Motivation Throughout and latency ### Your contribution @tgaddair what do you think?

enhancement

### Feature request DoRA introduces a bigger overhead than pure LoRA, so it is recommended to merge weights for inference, see https://github.com/huggingface/peft/blob/main/docs/source/developer_guides/lora.md#weight-decomposed-low-rank-adaptation-dora, it seems that this method will break current...

enhancement