[Bug] Why can't I use multi-lora adapter and radix attention together?
Checklist
- [X] 1. I have searched related issues but cannot get the expected help.
- [X] 2. The bug has not been fixed in the latest version.
- [ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [ ] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [X] 5. Please use English, otherwise it will be closed.
Describe the bug
Why can't I use multi-lora adapter and radix attention together? If I have multi-lora adapters, why not just insert the ID of the LoRA adapter before the first token?
When using a multi-lora adapter, it is extremely slow because radix attention cannot be used.
Reproduction
https://github.com/sgl-project/sglang/blob/v0.4.1.post5/python/sglang/srt/server_args.py#L876-L881
Environment
root@33e74a81f115:/sglang/python# python3 -m sglang.check_env
Python: 3.10.16 (main, Dec 4 2024, 08:53:37) [GCC 9.4.0]
CUDA available: True
GPU 0: NVIDIA A100-SXM4-80GB
GPU 0 Compute Capability: 8.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.131
CUDA Driver Version: 550.127.05
PyTorch: 2.5.1+cu124
sglang: 0.4.0.post2
flashinfer: 0.1.6+cu124torch2.4
triton: 3.1.0
transformers: 4.47.0
torchao: 0.6.1
numpy: 1.26.4
aiohttp: 3.11.10
fastapi: 0.115.6
hf_transfer: 0.1.8
huggingface_hub: 0.26.3
interegular: 0.3.3
modelscope: 1.21.0
orjson: 3.10.12
packaging: 24.2
psutil: 6.1.0
pydantic: 2.10.3
multipart: 0.0.19
zmq: 26.2.0
uvicorn: 0.32.1
uvloop: 0.21.0
vllm: 0.6.4.post1
openai: 1.57.0
anthropic: 0.40.0
decord: 0.6.0
Still under development @Fridge003
We will try to implement this feature 😃
This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.
Can you explain the reason why multi-lora and prefixes cannot be matched at the same time? Is it because of the system design or the computer adaptation?
Can you explain the reason why multi-lora and prefixes cannot be matched at the same time? Is it because of the system design or the computer adaptation?
Actually they are compatible. It's just we haven't fully tested the radix cache functionality for Lora. It will be supported soon.
supported in #7216