vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[V1][CUDA Graph] Fix attention metadata tensor sizes for padded batches

Open ayushsatyam146 opened this issue 3 months ago • 21 comments

Fixes #23789. Move CUDA graph padding before attention metadata construction to ensure tensor sizes match execution environment. Fixes crashes in FlashMLA and other attention backends caused by metadata/execution size mismatches.

  • Pass padded token count to _prepare_inputs()
  • Update slot_mapping and num_actual_tokens to use padded sizes
  • Eliminates need for backend-specific workarounds

ayushsatyam146 avatar Aug 31 '25 07:08 ayushsatyam146

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @ayushsatyam146.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Sep 01 '25 04:09 mergify[bot]

Thanks for the review @LucasWilkinson. I will work on the suggested changes. Do you have any idea on why am I getting CI failures on this consistently?

ayushsatyam146 avatar Sep 01 '25 04:09 ayushsatyam146

Do you have any idea on why am I getting CI failures on this consistently?

The CI can be flaky; I would just keep merging/rebasing-off main at a pretty regularly; tests are usually fixed on main in pretty short order

LucasWilkinson avatar Sep 02 '25 03:09 LucasWilkinson

@LucasWilkinson I have accomodated your suggestions in the new changes. Can you please take a look, Thanks!

ayushsatyam146 avatar Sep 03 '25 14:09 ayushsatyam146

Hi @LucasWilkinson, I addressed your comments, PTAL thanks!

ayushsatyam146 avatar Sep 05 '25 18:09 ayushsatyam146

image From the CI, it seems that there is an inconsistency in the input of varlen attention ops. It is on the eager part of a piecewise cudagraph, so you can easily check which input has the wrong shape for padded size 112 (22937 is of [112, 8, 2, 128]) instead of the actual bs 111. I think in piecewise cudagraph, attn_metadata doesn't need to be padded based on previous behaviour.

Move CUDA graph padding before attention metadata construction to ensure tensor sizes match execution environment. Fixes crashes in FlashMLA and other attention backends caused by metadata/execution size mismatches.

I guess this is the problem only for full cudagraph. Am I right?

fhl2000 avatar Sep 06 '25 04:09 fhl2000

Found another potential issue: https://github.com/vllm-project/vllm/pull/24002/files#diff-80ee7e2a62f9dcfbb8a312dc4e3948557e97ef187290daebbcae1e28596bda29R746-R753:~:text=self.seq_lens,().item()

self.seq_lens.np[num_reqs:].fill(0)

should be

self.seq_lens.np[num_actual_reqs:].fill(0)

since self.input_batch.num_computed_tokens_cpu[:num_reqs] may have padded zone uncleared.

fhl2000 avatar Sep 06 '25 08:09 fhl2000

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @ayushsatyam146.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Sep 08 '25 03:09 mergify[bot]

@ayushsatyam146 any progress on this?

LucasWilkinson avatar Sep 11 '25 14:09 LucasWilkinson

Hey @LucasWilkinson I got sick earlier this week and was unable to work on this but I will soon do the required changes and push them, Thanks for the patience!

ayushsatyam146 avatar Sep 11 '25 14:09 ayushsatyam146

@LucasWilkinson, I have done all the necessary changes that we agreed upon after discussion in the above threads, but I am unable to make the CI pipelines pass for these, and I think these are not flaky failures. I would really appreciate some help or maybe even some hints regarding solving these failures. Thanks!

ayushsatyam146 avatar Sep 15 '25 18:09 ayushsatyam146

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @ayushsatyam146.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Sep 16 '25 04:09 mergify[bot]

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @ayushsatyam146.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Sep 16 '25 17:09 mergify[bot]

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @ayushsatyam146.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Sep 20 '25 05:09 mergify[bot]

⚠️ The sha of the head commit of this PR conflicts with #25282. Mergify cannot evaluate rules on this PR. ⚠️

mergify[bot] avatar Sep 22 '25 04:09 mergify[bot]

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @ayushsatyam146.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Oct 05 '25 22:10 mergify[bot]

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @ayushsatyam146.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Oct 07 '25 03:10 mergify[bot]

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @ayushsatyam146.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Oct 10 '25 04:10 mergify[bot]

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @ayushsatyam146.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Nov 11 '25 16:11 mergify[bot]

Documentation preview: https://vllm--24002.org.readthedocs.build/en/24002/

mergify[bot] avatar Nov 12 '25 05:11 mergify[bot]

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @ayushsatyam146.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Nov 14 '25 19:11 mergify[bot]