lorax
lorax copied to clipboard
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
# What does this PR do? Fixes #433 ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if...
# What does this PR do? loading the tokenizer with remote code and unregistered class, will require user interactive input for confirming yes/no. That will break the normal processing. Fixes...
# (WIP) Fix for the LM_HEAD issue **Root Cause**. The error is caused by incorrect segments passed to the `lora_b_sgmv` kernel during the prefill stage. This happens because we do...
If speculative decoding is in use and the user wants to generate up to the max positional embeddings of the model, errors can arise at runtime causing a CUDA device-side...
### System Info lorax_version: "a7e8175" Python 3.10.8 Platform: ml.g5.16xlarge (AWS) When deploy the docker container with the source from "s3" and model_id "mistralai/Mistral-7B-Instruct-v0.2" (`lorax-launcher --port 8080 --source "s3"`), it failed...
### Feature request Implement `v1/models` like OpenAI API to list available local **loras**. This is dependent on #199 There is also a hurdle to this: A user may have multiple...
During fine-tuning, it's possible that special tokens are added that are specific to the adapter. During decoding, we should be using the special tokens, and ensure the correct stop tokens,...
WIP project roadmap for LoRAX. We'll continue to update this over time. # v0.10 - [ ] Speculative decoding adapters - [ ] AQLM # v0.11 - [ ] Prefix...
It seems that if the server is flooded with requests for a new adapter it needs to download, a race condition can arise leading to CUDA errors. Needs more investigation.