sglang icon indicating copy to clipboard operation
sglang copied to clipboard

[Bug] Why can't I use multi-lora adapter and radix attention together?

Open upskyy opened this issue 11 months ago • 2 comments

Checklist

  • [X] 1. I have searched related issues but cannot get the expected help.
  • [X] 2. The bug has not been fixed in the latest version.
  • [ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • [ ] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • [X] 5. Please use English, otherwise it will be closed.

Describe the bug

Why can't I use multi-lora adapter and radix attention together? If I have multi-lora adapters, why not just insert the ID of the LoRA adapter before the first token?

When using a multi-lora adapter, it is extremely slow because radix attention cannot be used.

Reproduction

https://github.com/sgl-project/sglang/blob/v0.4.1.post5/python/sglang/srt/server_args.py#L876-L881

Environment

root@33e74a81f115:/sglang/python# python3 -m sglang.check_env                                                                                                                                                         

Python: 3.10.16 (main, Dec  4 2024, 08:53:37) [GCC 9.4.0]
CUDA available: True
GPU 0: NVIDIA A100-SXM4-80GB
GPU 0 Compute Capability: 8.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.131
CUDA Driver Version: 550.127.05
PyTorch: 2.5.1+cu124
sglang: 0.4.0.post2
flashinfer: 0.1.6+cu124torch2.4
triton: 3.1.0
transformers: 4.47.0
torchao: 0.6.1
numpy: 1.26.4
aiohttp: 3.11.10
fastapi: 0.115.6
hf_transfer: 0.1.8
huggingface_hub: 0.26.3
interegular: 0.3.3
modelscope: 1.21.0
orjson: 3.10.12
packaging: 24.2
psutil: 6.1.0
pydantic: 2.10.3
multipart: 0.0.19
zmq: 26.2.0
uvicorn: 0.32.1
uvloop: 0.21.0
vllm: 0.6.4.post1
openai: 1.57.0
anthropic: 0.40.0
decord: 0.6.0

upskyy avatar Jan 14 '25 07:01 upskyy

Still under development @Fridge003

zhaochenyang20 avatar Jan 21 '25 22:01 zhaochenyang20

We will try to implement this feature 😃

Sunt-ing avatar Feb 17 '25 19:02 Sunt-ing

This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.

github-actions[bot] avatar Apr 19 '25 00:04 github-actions[bot]

Can you explain the reason why multi-lora and prefixes cannot be matched at the same time? Is it because of the system design or the computer adaptation?

GiggleWang avatar Jul 04 '25 08:07 GiggleWang

Can you explain the reason why multi-lora and prefixes cannot be matched at the same time? Is it because of the system design or the computer adaptation?

Actually they are compatible. It's just we haven't fully tested the radix cache functionality for Lora. It will be supported soon.

Fridge003 avatar Jul 04 '25 08:07 Fridge003

supported in #7216

Fridge003 avatar Aug 11 '25 18:08 Fridge003