vllm
vllm copied to clipboard
[Bug]: Capture CudaGraph with LoRA
Your current environment
The output of `python collect_env.py`
Your output of `python collect_env.py` here
🐛 Describe the bug
when I use LoRA with enabel_eager=False(which means it should capture cudaGraph), I find the below code could cause problem(in vllm/vllm/worker/model_runner.py):
if self.lora_config:
lora_mapping = LoRAMapping(
**dict(index_mapping=[0] * batch_size,
prompt_mapping=[0] * batch_size,
is_prefill=False))
self.set_active_loras(set(), lora_mapping)
then I print token_lora_indices by self.lora_manager._adapter_manager.punica_wrapper._token_lora_indices, but only get tensor([-1, -1, -1, ..., 0, 0, 0], device='cuda:0'). A token with LoRA_indices=-1 seems not right.
Before submitting a new issue...
- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
I don't think LoRA should be captured in CUDA Graph, especially in the case you might want to switch multiple different loras.
What is the behaviour that you observed with enable_eager=True?
-1 means no lora be applied.
I defined a model myself and called bgmv in it to do some LoRA calculations, so indices=-1 resulted in cuda error.
I don't think LoRA should be captured in CUDA Graph, especially in the case you might want to switch multiple different loras.
What is the behaviour that you observed with
enable_eager=True?
if I run it with enable_eager=True, the total model will not be captured by capture function, so indices=-1 will not appear.
-1means no lora be applied.
Seems I should modify my implementation for situtations like indices=-1.
the total model will not be captured by well, yes, that's the behaviour of
enforce_eager=True...
correct me if i'm wrong, but you shouldn't capture the cuda graph for LoRA?
Yes I don't want lora to be captured indeed. I think my error was caused by my missuse of bgmv kernel.
It is supported in https://github.com/vllm-project/vllm/pull/14626
Can you try again with 0.8.0?
I'm currently using version 0.7.2
I think I'll try cudagraph for lora of version 0.8 in the future
Also I'd like to ask a question, enabling cudagraph for lora doesn't seem to require too many code changes? This has provided me with a lot of convenience in learning your code😊
The 0.7.2 version should still be the V0 version of LoRA.
For V0, vllm only captures cudagraph during the decode stage, and lora supports cudagraph, which you can confirm through torch.profiler.
The 0.7.2 version should still be the V0 version of LoRA. For V0, vllm only captures cudagraph during the decode stage, and lora supports cudagraph, which you can confirm through
torch.profiler.
Thanks for your reply! I'm curious about the difference between LoRA V0 and V1, for this I read the description in https://github.com/vllm-project/vllm/pull/13096, but I'm still confused by the sentence "V1 doesn't group requests based on LoRA ID. The new set of kernels have information about which input tokens map to which LoRA ID and they use this information to load the appropriate input tokens. " this looks like the implementation about bgmv?
From Slack, it seems to be both supported for v0 and v1 once you upgrade to latest vllm.
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!