vllm [V1][CUDA Graph] Fix attention metadata tensor sizes for padded batches

Fixes #23789. Move CUDA graph padding before attention metadata construction to ensure tensor sizes match execution environment. Fixes crashes in FlashMLA and other attention backends caused by metadata/execution size mismatches.

Pass padded token count to _prepare_inputs()
Update slot_mapping and num_actual_tokens to use padded sizes
Eliminates need for backend-specific workarounds

Aug 31 '25 07:08 ayushsatyam146

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @ayushsatyam146.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Sep 01 '25 04:09 mergify[bot]

Thanks for the review @LucasWilkinson. I will work on the suggested changes. Do you have any idea on why am I getting CI failures on this consistently?

Sep 01 '25 04:09 ayushsatyam146

Do you have any idea on why am I getting CI failures on this consistently?

The CI can be flaky; I would just keep merging/rebasing-off main at a pretty regularly; tests are usually fixed on main in pretty short order

Sep 02 '25 03:09 LucasWilkinson

@LucasWilkinson I have accomodated your suggestions in the new changes. Can you please take a look, Thanks!

Sep 03 '25 14:09 ayushsatyam146

Hi @LucasWilkinson, I addressed your comments, PTAL thanks!

Sep 05 '25 18:09 ayushsatyam146

From the CI, it seems that there is an inconsistency in the input of varlen attention ops. It is on the eager part of a piecewise cudagraph, so you can easily check which input has the wrong shape for padded size 112 (22937 is of [112, 8, 2, 128]) instead of the actual bs 111. I think in piecewise cudagraph, attn_metadata doesn't need to be padded based on previous behaviour.

Move CUDA graph padding before attention metadata construction to ensure tensor sizes match execution environment. Fixes crashes in FlashMLA and other attention backends caused by metadata/execution size mismatches.

I guess this is the problem only for full cudagraph. Am I right?

Sep 06 '25 04:09 fhl2000

Found another potential issue: https://github.com/vllm-project/vllm/pull/24002/files#diff-80ee7e2a62f9dcfbb8a312dc4e3948557e97ef187290daebbcae1e28596bda29R746-R753:~:text=self.seq_lens,().item()

self.seq_lens.np[num_reqs:].fill(0)

should be

self.seq_lens.np[num_actual_reqs:].fill(0)

since self.input_batch.num_computed_tokens_cpu[:num_reqs] may have padded zone uncleared.

Sep 06 '25 08:09 fhl2000

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @ayushsatyam146.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Sep 08 '25 03:09 mergify[bot]

@ayushsatyam146 any progress on this?

Sep 11 '25 14:09 LucasWilkinson

Hey @LucasWilkinson I got sick earlier this week and was unable to work on this but I will soon do the required changes and push them, Thanks for the patience!

Sep 11 '25 14:09 ayushsatyam146

@LucasWilkinson, I have done all the necessary changes that we agreed upon after discussion in the above threads, but I am unable to make the CI pipelines pass for these, and I think these are not flaky failures. I would really appreciate some help or maybe even some hints regarding solving these failures. Thanks!

Sep 15 '25 18:09 ayushsatyam146

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @ayushsatyam146.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Sep 16 '25 04:09 mergify[bot]

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @ayushsatyam146.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Sep 16 '25 17:09 mergify[bot]

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @ayushsatyam146.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Sep 20 '25 05:09 mergify[bot]

⚠️ The sha of the head commit of this PR conflicts with #25282. Mergify cannot evaluate rules on this PR. ⚠️

Sep 22 '25 04:09 mergify[bot]

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @ayushsatyam146.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Oct 05 '25 22:10 mergify[bot]

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @ayushsatyam146.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Oct 07 '25 03:10 mergify[bot]

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @ayushsatyam146.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Oct 10 '25 04:10 mergify[bot]

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @ayushsatyam146.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Nov 11 '25 16:11 mergify[bot]

Documentation preview: https://vllm--24002.org.readthedocs.build/en/24002/

Nov 12 '25 05:11 mergify[bot]

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @ayushsatyam146.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Nov 14 '25 19:11 mergify[bot]

vllm vllm copied to clipboard

[V1][CUDA Graph] Fix attention metadata tensor sizes for padded batches

vllm
vllm copied to clipboard