vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Bug]: Capture CudaGraph with LoRA

Open chenhongyu2048 opened this issue 8 months ago • 11 comments
trafficstars

Your current environment

The output of `python collect_env.py`
Your output of `python collect_env.py` here

🐛 Describe the bug

when I use LoRA with enabel_eager=False(which means it should capture cudaGraph), I find the below code could cause problem(in vllm/vllm/worker/model_runner.py):

if self.lora_config:
    lora_mapping = LoRAMapping(
        **dict(index_mapping=[0] * batch_size,
               prompt_mapping=[0] * batch_size,
               is_prefill=False))
    self.set_active_loras(set(), lora_mapping)

then I print token_lora_indices by self.lora_manager._adapter_manager.punica_wrapper._token_lora_indices, but only get tensor([-1, -1, -1, ..., 0, 0, 0], device='cuda:0'). A token with LoRA_indices=-1 seems not right.

Before submitting a new issue...

  • [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

chenhongyu2048 avatar Mar 19 '25 05:03 chenhongyu2048

I don't think LoRA should be captured in CUDA Graph, especially in the case you might want to switch multiple different loras.

What is the behaviour that you observed with enable_eager=True?

aarnphm avatar Mar 19 '25 05:03 aarnphm

-1 means no lora be applied.

jeejeelee avatar Mar 19 '25 05:03 jeejeelee

I defined a model myself and called bgmv in it to do some LoRA calculations, so indices=-1 resulted in cuda error.

I don't think LoRA should be captured in CUDA Graph, especially in the case you might want to switch multiple different loras.

What is the behaviour that you observed with enable_eager=True?

if I run it with enable_eager=True, the total model will not be captured by capture function, so indices=-1 will not appear.

-1 means no lora be applied.

Seems I should modify my implementation for situtations like indices=-1.

chenhongyu2048 avatar Mar 19 '25 06:03 chenhongyu2048

the total model will not be captured by well, yes, that's the behaviour of enforce_eager=True...

correct me if i'm wrong, but you shouldn't capture the cuda graph for LoRA?

aarnphm avatar Mar 19 '25 11:03 aarnphm

Yes I don't want lora to be captured indeed. I think my error was caused by my missuse of bgmv kernel.

chenhongyu2048 avatar Mar 19 '25 13:03 chenhongyu2048

It is supported in https://github.com/vllm-project/vllm/pull/14626

Can you try again with 0.8.0?

aarnphm avatar Mar 19 '25 15:03 aarnphm

I'm currently using version 0.7.2
I think I'll try cudagraph for lora of version 0.8 in the future

Also I'd like to ask a question, enabling cudagraph for lora doesn't seem to require too many code changes? This has provided me with a lot of convenience in learning your code😊

chenhongyu2048 avatar Mar 20 '25 06:03 chenhongyu2048

The 0.7.2 version should still be the V0 version of LoRA. For V0, vllm only captures cudagraph during the decode stage, and lora supports cudagraph, which you can confirm through torch.profiler.

jeejeelee avatar Mar 20 '25 07:03 jeejeelee

The 0.7.2 version should still be the V0 version of LoRA. For V0, vllm only captures cudagraph during the decode stage, and lora supports cudagraph, which you can confirm through torch.profiler.

Thanks for your reply! I'm curious about the difference between LoRA V0 and V1, for this I read the description in https://github.com/vllm-project/vllm/pull/13096, but I'm still confused by the sentence "V1 doesn't group requests based on LoRA ID. The new set of kernels have information about which input tokens map to which LoRA ID and they use this information to load the appropriate input tokens. " this looks like the implementation about bgmv?

chenhongyu2048 avatar Mar 20 '25 17:03 chenhongyu2048

From Slack, it seems to be both supported for v0 and v1 once you upgrade to latest vllm.

aarnphm avatar Mar 20 '25 19:03 aarnphm

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

github-actions[bot] avatar Jun 20 '25 02:06 github-actions[bot]

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

github-actions[bot] avatar Jul 20 '25 02:07 github-actions[bot]