vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Perf] Optimize MRotaryEmbedding::get_input_positions performance by numba

Open imkero opened this issue 7 months ago • 4 comments

Blocker

#16878

Qwen2/2.5-VL Benchmark

input length multi-modal data main branch this PR
374 tokens 1 (num images) x 324 (per image) image tokens 0.305ms 0.012ms
3308 tokens 10 (num images) x 324 (per image) image tokens 1.994ms 0.018ms
5379 tokens 1 (num images) x 5329 (per image) image tokens 0.944ms 0.016ms
3290 tokens 1 (num videos) x 10 (num temporal) x 324 (per frame) video tokens 0.666ms 0.016ms
32450 tokens 1 (num videos) x 100 (num temporal) x 324 (per frame) video tokens 4.050ms 0.063ms

Qwen2.5-Omni Benchmark

input length multi-modal data main branch this PR
374 tokens 1 (num images) x 324 (per image) image tokens 1.272ms 0.018ms
3308 tokens 10 (num images) x 324 (per image) image tokens 3.527ms 0.024ms
5379 tokens 1 (num images) x 5329 (per image) image tokens 1.367ms 0.025ms
3290 tokens 1 (num videos) x 10 (num temporal) x 324 (per frame) video tokens 1.329ms 0.022ms
32450 tokens 1 (num videos) x 100 (num temporal) x 324 (per frame) video tokens 1.797ms 0.072ms
3542 tokens 1 (num videos) x 10 (num temporal) x 324 (per frame) video tokens
+ 250 audio tokens
16.664ms 0.022ms

imkero avatar Apr 19 '25 21:04 imkero

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

github-actions[bot] avatar Apr 19 '25 21:04 github-actions[bot]

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @imkero.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Apr 24 '25 14:04 mergify[bot]

@imkero Thanks for the PR! This is amazing 🚀 Could you please resolve the merge conflict and the lint error?

WoosukKwon avatar Apr 26 '25 17:04 WoosukKwon

@imkero Thanks for the PR! This is amazing 🚀 Could you please resolve the merge conflict and the lint error?

Sure! I will update this PR soon.

TODO:

  • [x] fix merge conflicts & lint error
  • [x] complete the PR description
  • [x] enhance the error msg in assertions
  • [x] add unit tests for assertions
  • [x] refresh the benchmark result
  • [x] e2e correctness test

And I will share my experience of using numba later.

imkero avatar Apr 26 '25 18:04 imkero

@imkero Just so you know: To fix the CI failure, we should move numba from requirements/cuda.txt and requirements/rocm.txt to requirements/common.txt.

WoosukKwon avatar Apr 27 '25 19:04 WoosukKwon

@imkero Just so you know: To fix the CI failure, we should move numba from requirements/cuda.txt and requirements/rocm.txt to requirements/common.txt.

Thanks for your remind, I have moved it to requirements/common.txt now.

imkero avatar May 01 '25 15:05 imkero

@WoosukKwon I think this PR is ready for review now

imkero avatar May 01 '25 16:05 imkero

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @imkero.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar May 02 '25 08:05 mergify[bot]

I will take get_next_positions_tensor into consideration because they are reported to be time-costing as well in #17617

imkero avatar May 05 '25 06:05 imkero

I have written an optimized version of get_next_input_positions_tensor which is zero copy and no lru cache.

Will add e2e benchmarking result like #17617

imkero avatar May 05 '25 07:05 imkero

This PR continues the idea of #17617. Thanks @vadiklyutiy

Could you please take a look? @ywang96

imkero avatar May 05 '25 11:05 imkero

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @imkero.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar May 08 '25 05:05 mergify[bot]

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @imkero.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar May 13 '25 11:05 mergify[bot]