vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Kernel][Model] PagedAttention: Support custom attention bias for T5 model (1/2)

Open NickLucche opened this issue 11 months ago • 10 comments

This contribution is part of 2 PRs attempting to add support for the T5 model. Based on the great work done in https://github.com/vllm-project/vllm/issues/7366#user-content-support-custom-attention-bias and https://github.com/vllm-project/vllm/pull/3117.

It implements:

$$ Attention(Q, K, V) = \text{softmax}\left(\frac{QK^\top + A}{\sqrt{d_k}}\right)V $$

In particular, the following aims to be a more generalized addition (not tied to T5), enabling any custom attention bias in the PagedAttention kernel.

While I am aware this introduces a fully materialized matrix (akin to what's done in xformers), currently padded to max_seq_len, I believe the flexibility brought by allowing any custom bias would still pay for itself when providing initial compatibility for yet unsupported models (such as T5). I'd be happy to focus on performance optimization later (eg variable len bias matrix or flashattn-like IO optimization) or even have yet another model-specific feature in the kernel for T5 to compute relative positional embeddings.

TODO work left for future contribs (ie extend to all platforms):

  • ROCm attn bias support
  • CPU attn bias support
  • ..?

NickLucche avatar Dec 19 '24 10:12 NickLucche

👋 Hi! Thank you for contributing to the vLLM project. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

github-actions[bot] avatar Dec 19 '24 10:12 github-actions[bot]

While I am aware this introduces a fully materialized matrix (akin to what's done in xformers), currently padded to max_seq_len

@NickLucche In case of T5, it can be limited to the maximum relative distance, which for T5 Large is 128. Anything beyond 128 just uses the last value from the bias tensor.

sfc-gh-phalama avatar Dec 19 '24 13:12 sfc-gh-phalama

Thanks for the quick input @sfc-gh-phalama, personally I'd rather not have the attn bias implementation depend or be tied to T5/relative positional encoding mechanisms, otherwise we could probably implement the inline formula.

NickLucche avatar Dec 20 '24 08:12 NickLucche

Fixed the kernel tests failing and rebased, not sure how related the rest of errors may be tbh.. https://buildkite.com/vllm/ci/builds/12022#01946b42-da66-4512-919d-c2115515eb6d/147-444

AssertionError: please set tensor_parallel_size (4) to less than max local gpu count (1) looks more like a CI/node issue

NickLucche avatar Jan 15 '25 20:01 NickLucche

@NickLucche the gpu count issue might be fixed by this PR: https://github.com/vllm-project/vllm/pull/12111

joerunde avatar Jan 16 '25 18:01 joerunde

CI failures still look unrelated to me :/

NickLucche avatar Jan 21 '25 18:01 NickLucche

I believe this PR could also be useful in places like https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/mllama.py#L863, where we can pass in the custom attention mask as bias term A, while getting rid of the logic to handle kv cache in the model.

NickLucche avatar Jan 22 '25 18:01 NickLucche

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @NickLucche.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Feb 05 '25 21:02 mergify[bot]

Rebased

NickLucche avatar Feb 17 '25 16:02 NickLucche

Hey thanks for the review!

Do the attn_bias arguments need to be added to the other attention backends?

Initial effort was planned only for the pagedattn backend, so that we could run at least with xformers+pagedattn. FlashAttn would require a separate PR on base repo or on vllm maintained fork, so that requires more involvement.

Is the new tensor materialised even if the attention bias is not used?

Nope attn_bias it's an optional pointer, when the pointer is null it's a noop.

NickLucche avatar Feb 17 '25 17:02 NickLucche

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @NickLucche.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Mar 13 '25 03:03 mergify[bot]

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

github-actions[bot] avatar Jun 12 '25 02:06 github-actions[bot]

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

github-actions[bot] avatar Sep 22 '25 02:09 github-actions[bot]

Closing as vLLM no longer supports encoder-decoder models

hmellor avatar Oct 02 '25 23:10 hmellor