vllm Implements dual-chunk-flash-attn backend for dual chunk attention with sparse attention support

This PR implements the dual-chunk flash attention, a training-free method to extend model context length (see also #6139), with sparse attention (https://github.com/microsoft/MInference) support.

This PR requires the sparse attention kernel from vllm-flash-attention. Qwen models with 1m context length support will be open-sourced in the next one or two weeks, and unit tests will be added later.

FIX #12452

Jan 08 '25 13:01 sighingnow

👋 Hi! Thank you for contributing to the vLLM project. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

Jan 08 '25 13:01 github-actions[bot]

I see that you have enforce_eager=True set, so it looks like there are still compatibility issues with cudagraph. Do you plan to fix this in the future?

Jan 09 '25 09:01 jacob-crux

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @sighingnow.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Jan 13 '25 12:01 mergify[bot]

I see that you have enforce_eager=True set, so it looks like there are still compatibility issues with cudagraph. Do you plan to fix this in the future?

All conflicts fixed, could you please take another look? thanks!

Jan 14 '25 03:01 sighingnow

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @sighingnow.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Jan 15 '25 05:01 mergify[bot]

I see that you have enforce_eager=True set, so it looks like there are still compatibility issues with cudagraph. Do you plan to fix this in the future?

All conflicts fixed, could you please take another look? thanks!

I tested it because I thought it was fixed, but I still have the same problem as below. Are you saying that Cudagraph capture is possible? (enforce_eager=False)

Capturing CUDA graph shapes:   0%|                                                                                                                                                                                                               | 0/35 [00:00<?, ?it/s]
[rank0]: Traceback (most recent call last):
[rank0]:   File "/data/lme-storage_810/jacob/needle/NeedleInAHaystack-lme/run_needle_in_haystack.py", line 435, in <module>
[rank0]:     ht = LLMNeedleHaystackTester(
[rank0]:          ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/data/lme-storage_810/jacob/needle/NeedleInAHaystack-lme/run_needle_in_haystack.py", line 94, in __init__
[rank0]:     self.model_to_test = LLM(model=model_name)
[rank0]:                          ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/bc-user/vllm_dual_chunk_250114/vllm/vllm/utils.py", line 1044, in inner
[rank0]:     return fn(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/bc-user/vllm_dual_chunk_250114/vllm/vllm/entrypoints/llm.py", line 228, in __init__
[rank0]:     self.llm_engine = self.engine_class.from_engine_args(
[rank0]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/bc-user/vllm_dual_chunk_250114/vllm/vllm/engine/llm_engine.py", line 517, in from_engine_args
[rank0]:     engine = cls(
[rank0]:              ^^^^
[rank0]:   File "/home/bc-user/vllm_dual_chunk_250114/vllm/vllm/engine/llm_engine.py", line 276, in __init__
[rank0]:     self._initialize_kv_caches()
[rank0]:   File "/home/bc-user/vllm_dual_chunk_250114/vllm/vllm/engine/llm_engine.py", line 429, in _initialize_kv_caches
[rank0]:     self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
[rank0]:   File "/home/bc-user/vllm_dual_chunk_250114/vllm/vllm/executor/gpu_executor.py", line 83, in initialize_cache
[rank0]:     self.driver_worker.initialize_cache(num_gpu_blocks, num_cpu_blocks)
[rank0]:   File "/home/bc-user/vllm_dual_chunk_250114/vllm/vllm/worker/worker.py", line 274, in initialize_cache
[rank0]:     self._warm_up_model()
[rank0]:   File "/home/bc-user/vllm_dual_chunk_250114/vllm/vllm/worker/worker.py", line 292, in _warm_up_model
[rank0]:     self.model_runner.capture_model(self.gpu_cache)
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/bc-user/vllm_dual_chunk_250114/vllm/vllm/worker/model_runner.py", line 1533, in capture_model
[rank0]:     graph_runner.capture(**capture_inputs)
[rank0]:   File "/home/bc-user/vllm_dual_chunk_250114/vllm/vllm/worker/model_runner.py", line 1885, in capture
[rank0]:     self.model(
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/bc-user/vllm_dual_chunk_250114/vllm/vllm/model_executor/models/qwen2.py", line 496, in forward
[rank0]:     hidden_states = self.model(input_ids, positions, kv_caches,
[rank0]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/bc-user/vllm_dual_chunk_250114/vllm/vllm/compilation/decorators.py", line 170, in __call__
[rank0]:     return self.forward(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/bc-user/vllm_dual_chunk_250114/vllm/vllm/model_executor/models/qwen2.py", line 359, in forward
[rank0]:     hidden_states, residual = layer(
[rank0]:                               ^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/bc-user/vllm_dual_chunk_250114/vllm/vllm/model_executor/models/qwen2.py", line 267, in forward
[rank0]:     hidden_states = self.self_attn(
[rank0]:                     ^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/bc-user/vllm_dual_chunk_250114/vllm/vllm/model_executor/models/qwen2.py", line 189, in forward
[rank0]:     attn_output = self.attn(q,
[rank0]:                   ^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/bc-user/vllm_dual_chunk_250114/vllm/vllm/attention/layer.py", line 185, in forward
[rank0]:     return torch.ops.vllm.unified_attention(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1116, in __call__
[rank0]:     return self._op(*args, **(kwargs or {}))
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/bc-user/vllm_dual_chunk_250114/vllm/vllm/attention/layer.py", line 280, in unified_attention
[rank0]:     return self.impl.forward(query, key, value, kv_cache, attn_metadata,
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/bc-user/vllm_dual_chunk_250114/vllm/vllm/attention/backends/dual_chunk_flash_attn.py", line 373, in forward
[rank0]:     assert decode_meta.scaling_factor is not None
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: AssertionError

Jan 15 '25 05:01 jacob-crux

I tested it because I thought it was fixed, but I still have the same problem as below. Are you saying that Cudagraph capture is possible? (enforce_eager=False)

The dual chunk attention doesn't support cuda graph and I have added an assertion in arg_utils.py.

When I try the Needle in a haystack test with qwen-7b and llama-8b(Modified code to support llama), there is a bug that produces a negative number when it is over 13k~15k.

It is indeed a bug introduced during preparing this PR, fixed. Thanks!

Jan 16 '25 09:01 sighingnow

Rebase against main.

Hi @youkaichao @simon-mo @WoosukKwon Do you folks think if there are still things that need to be improved in this pull request?

Thanks!

Jan 19 '25 09:01 sighingnow

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @sighingnow.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Jan 20 '25 21:01 mergify[bot]

Hi @LucasWilkinson most of the comments has been addressed, could you please take another look? Thanks!

The lint error comes from the prompt text, do you have any suggestion about how could I skip/resolve it?

Jan 23 '25 19:01 sighingnow

@sighingnow Sorry for the delayed response! I've merged main into your branch so the pre-commit error should be cleared. I'll enable ready status for this PR so at least we can get the CI going before @tlrmchlsmth or @LucasWilkinson want to give their final greenlight!

Jan 27 '25 04:01 ywang96

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @sighingnow.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Feb 03 '25 19:02 mergify[bot]

Any progress?

Mar 07 '25 01:03 halexan

Any progress?

Mar 12 '25 02:03 freedomkk-qfeng

+1

Mar 12 '25 16:03 Ki6an

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @sighingnow.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Apr 11 '25 14:04 mergify[bot]

I think this is getting very close, thanks for rebasing it! My main concern right now is the large text files in the repo. Also there appear to still be unaddressed review comments from before, please ping us when this is ready for final review.

Hi @LucasWilkinson, thanks for these comments. I have rebased this branch over current main, removed those example prompts and provided them as URLs, and address the reviewer comments above in this PR. Now I think it should be ready for landing.

Before landing, a bugfix in flash-attention should be merged first: https://github.com/vllm-project/flash-attention/pull/60. After that, I will revise the dependency version of vllm-flash-attention in this PR.

Apr 12 '25 18:04 sighingnow

A couple of questions:

What will happen with this PR when running Qwen2 on systems where the dual-chunk attention backend is not supported? (e.g. AMD GPUs, TPUs, etc)

Does vLLM automatically fall back to V0 when using dual-chunk attention?

We have launched the work of migrating qwen related changes in our internal repo to v1 since v1 becomes the default option in vLLM. The dual-chunk-attn backend would be adapted to v1, too, and most of changesets could be reused.

I have added an assertion in arg_utils.py to check if the current platform is cuda(the sparse_attn_func is only available in vllm-project/flash-attention for cuda) and if the current engine is v0.

Apr 13 '25 03:04 sighingnow

https://github.com/vllm-project/flash-attention/pull/60 has landed can you please update this PR?

Apr 15 '25 02:04 LucasWilkinson

vllm-project/flash-attention#60 has landed can you please update this PR?

Done, and rebased to main.

Apr 15 '25 18:04 sighingnow

@LucasWilkinson I have rebased to current main again. Could you please take another look on this PR? Thanks!

Apr 21 '25 01:04 sighingnow

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @sighingnow.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

May 01 '25 02:05 mergify[bot]

Hi @LucasWilkinson, thanks for the feedback. The first three comments has been addressed.

[x] Address: https://github.com/vllm-project/vllm/pull/11844/files#r2070914950

[x] Rebase

[x] Address: https://github.com/vllm-project/vllm/pull/11844/files#r1943492013

[ ] @mgoin address: https://github.com/vllm-project/vllm/pull/11844/files#r2039650211

May 09 '25 08:05 sighingnow

@sighingnow Thanks for the update! looking into the CI failure it does not appear to be related (V1 code, this PR does not touch V1) but this is a bit out of my area of expertise, asking around (cc @russellb)

May 09 '25 21:05 LucasWilkinson

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @sighingnow.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

May 09 '25 22:05 mergify[bot]

@sighingnow Thanks for the update! looking into the CI failure it does not appear to be related (V1 code, this PR does not touch V1) but this is a bit out of my area of expertise, asking around (cc @russellb)

Rebased against main again. The failed test cases shouldn't be caused by this PR. It failed on a speculative decoding cases and seems that that case is not executed by all PRs.

May 10 '25 07:05 sighingnow

So If I understand correctly, now Qwen2.5-1M actually uses the correct attention mechanism and VRAM should be lowered and prompt processing faster, right ?

May 25 '25 09:05 ExtReMLapin

I tested Qwen/Qwen2.5-7B-Instruct-1M using DualChunkFlashAttention backend. It startup well, but not work well. @sighingnow

ubuntu-vllm-openai-1 | INFO 05-31 19:13:07 [logger.py:42] Received request cmpl-77d91882816c4f748e2023c93449f62d-0: prompt: 'Once upon a time', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.05, temperature=0.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1000, min_tokens=0, logprobs=1, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: [12522, 5193, 264, 882], prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None. ubuntu-vllm-openai-1 | INFO 05-31 19:13:07 [engine.py:316] Added request cmpl-77d91882816c4f748e2023c93449f62d-0. ubuntu-vllm-openai-1 | INFO: 172.18.0.1:46884 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] AssertionError('seqused_k must be provided if block_table is provided') ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] Traceback (most recent call last): ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 162, in start ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] self.run_engine_loop() ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 225, in run_engine_loop ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] request_outputs = self.engine_step() ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] ^^^^^^^^^^^^^^^^^^ ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 251, in engine_step ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] raise e ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 234, in engine_step ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] return self.engine.step() ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] ^^^^^^^^^^^^^^^^^^ ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 1393, in step ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] outputs = self.model_executor.execute_model( ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 299, in execute_model ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] driver_outputs = self._driver_execute_model(execute_model_req) ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/mp_distributed_executor.py", line 144, in _driver_execute_model ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] return self.driver_worker.execute_model(execute_model_req) ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 420, in execute_model ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] output = self.model_runner.execute_model( ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] return func(*args, **kwargs) ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] ^^^^^^^^^^^^^^^^^^^^^ ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1843, in execute_model ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] hidden_or_intermediate_states = model_executable( ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] ^^^^^^^^^^^^^^^^^ ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] return self._call_impl(*args, **kwargs) ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] return forward_call(*args, **kwargs) ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 481, in forward ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] hidden_states = self.model(input_ids, positions, intermediate_tensors, ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 172, in call ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] return self.forward(*args, **kwargs) ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 358, in forward ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] hidden_states, residual = layer( ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] ^^^^^^ ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] return self._call_impl(*args, **kwargs) ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] return forward_call(*args, **kwargs) ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 257, in forward ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] hidden_states = self.self_attn( ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] ^^^^^^^^^^^^^^^ ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] return self._call_impl(*args, **kwargs) ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] return forward_call(*args, **kwargs) ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 187, in forward ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] attn_output = self.attn(q, k, v) ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] ^^^^^^^^^^^^^^^^^^ ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] return self._call_impl(*args, **kwargs) ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] return forward_call(*args, **kwargs) ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/vllm/attention/layer.py", line 237, in forward ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] return torch.ops.vllm.unified_attention( ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1158, in call ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] return self._op(*args, **(kwargs or {})) ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/vllm/attention/layer.py", line 386, in unified_attention ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] output = self.impl.forward(self, query, key, value, kv_cache, ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/vllm/attention/backends/dual_chunk_flash_attn.py", line 493, in forward ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] self._dual_chunk_flash_attn_prefill( ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/vllm/attention/backends/dual_chunk_flash_attn.py", line 673, in _dual_chunk_flash_attn_prefill ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] current_out = self._dual_chunk_flash_attn_prefill_func( ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/vllm/attention/backends/dual_chunk_flash_attn.py", line 1055, in _dual_chunk_flash_attn_prefill_func ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] flash_result = self._do_flash_attn( ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] ^^^^^^^^^^^^^^^^^^^^ ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/vllm/attention/backends/dual_chunk_flash_attn.py", line 1207, in _do_flash_attn ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] output, softmax_lse = flash_attn_varlen_func( ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] ^^^^^^^^^^^^^^^^^^^^^^^ ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/vllm/vllm_flash_attn/flash_attn_interface.py", line 204, in flash_attn_varlen_func ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] assert block_table is None or seqused_k is not None,
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] AssertionError: seqused_k must be provided if block_table is provided

Jun 01 '25 02:06 exceedzhang

Exact same issue as above

Jun 04 '25 14:06 ExtReMLapin

PR #19084 Fixes this issue.

When working with contexts of 70k, with the model loaded + the context it uses something like 30Gb of vram, but during inference it goes up to 35-37gb of vram then back down to 30Gb.

I'm guessing it's expected but is there some kind of way to preallocating this memory ? Because if you let VLLM allocate 80% of the vram and it tries to "eat" more VRAM, well obviously it will OOM

Edit :

FP8 model quantization is not working
--pipeline_parallel_size is not working
--tensor_parallel_size is not working

Jun 05 '25 04:06 ExtReMLapin