vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Attention] Update to lastest FA3 code that supports different K and V head dims

Open LucasWilkinson opened this issue 9 months ago • 6 comments

Based on: https://github.com/vllm-project/vllm/pull/14253 merge that first

Update to new FA3 which supports mixed headdims so we dont have to pad for prefill

VLLM_USE_V1=1 VLLM_USE_TRITON_FLASH_ATTN=0 lm_eval --model vllm --model_args pretrained=deepseek-ai/DeepSeek-V2-Lite-Chat,tensor_parallel_size=2,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True,max_model_len=16384,enforce_eager=True,max_num_batched_tokens=1024 --task gsm8k --num_fewshot=5 --limit 10

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |  0.8|±  |0.1333|
|     |       |strict-match    |     5|exact_match|↑  |  0.8|±  |0.1333|

VLLM_USE_V1=1 VLLM_USE_TRITON_FLASH_ATTN=1 lm_eval --model vllm --model_args pretrained=deepseek-ai/DeepSeek-V2-Lite-Chat,tensor_parallel_size=2,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True,max_model_len=16384,enforce_eager=True,max_num_batched_tokens=1024 --task gsm8k --num_fewshot=5 --limit 10

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |  0.8|±  |0.1333|
|     |       |strict-match    |     5|exact_match|↑  |  0.8|±  |0.1333|

VLLM_USE_V1=0 VLLM_USE_TRITON_FLASH_ATTN=0 lm_eval --model vllm --model_args pretrained=deepseek-ai/DeepSeek-V2-Lite-Chat,tensor_parallel_size=2,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True
,max_model_len=16384,enforce_eager=True,max_num_batched_tokens=1024,enable_chunked_prefill=True --task gsm8k --num_fewshot=5 --limit 10

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |  0.8|±  |0.1333|
|     |       |strict-match    |     5|exact_match|↑  |  0.8|±  |0.1333|

VLLM_USE_V1=0 VLLM_USE_TRITON_FLASH_ATTN=1 lm_eval --model vllm --model_args pretrained=deepseek-ai/DeepSeek-V2-Lite-Chat,tensor_parallel_size=2,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True,max_model_len=16384,enforce_eager=True,max_num_batched_tokens=1024,enable_chunked_prefill=True --task gsm8k --num_fewshot=5 --limit 10

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |  0.8|±  |0.1333|
|     |       |strict-match    |     5|exact_match|↑  |  0.8|±  |0.1333|

LucasWilkinson avatar Feb 11 '25 19:02 LucasWilkinson

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

github-actions[bot] avatar Feb 11 '25 19:02 github-actions[bot]

@khluu can we run the perf CI on this? would be nice to check for regressions since theres alot of FA changes

LucasWilkinson avatar Feb 11 '25 20:02 LucasWilkinson

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Feb 27 '25 02:02 mergify[bot]

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Mar 06 '25 01:03 mergify[bot]

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Mar 06 '25 11:03 mergify[bot]

not seeing much speedup (minor TFTT at very long contexts)

8xH200, DeepSeek-R1, VLLM_USE_V1=1 VLLM_ATTENTION_BACKEND=FLASHMLA VLLM_USE_FLASHINFER_SAMPLER=1

Main:

  backend  input_tokens  output_tokens  output_toks/s     req/s  median_itl_ms  median_ttft_ms
2    vllm          1000           1000    1095.323697  1.095324      40.931626      149.658605
1    vllm          5000           1000     517.327850  0.517328      39.956240     5627.535715
3    vllm         10000           1000     315.639817  0.315640      39.697455    57821.907031
0    vllm         32000           1000     106.821047  0.106821      40.109005   193232.262791


This PR:

  backend  input_tokens  output_tokens  output_toks/s     req/s  median_itl_ms  median_ttft_ms
2    vllm          1000           1000    1034.089425  1.034089      39.884763     2365.846871
1    vllm          5000           1000     518.861897  0.518862      39.906694     5312.827895
3    vllm         10000           1000     317.878565  0.317879      40.045355    57160.958344
0    vllm         32000           1000     110.514999  0.110515      40.538517   184969.578100

still worth landing in preparation for: https://github.com/vllm-project/vllm/pull/14258

LucasWilkinson avatar Mar 07 '25 04:03 LucasWilkinson

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Mar 22 '25 05:03 mergify[bot]

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Apr 09 '25 22:04 mergify[bot]

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Apr 11 '25 13:04 mergify[bot]

Hi, @LucasWilkinson @simon-mo. This PR uses a new function get_scheduler_metadata:

https://github.com/vllm-project/flash-attention/blob/main/vllm_flash_attn/flash_attn_interface.py#L78

It seems that a new release of vllm_flash_attn is needed.

chaunceyjiang avatar Apr 18 '25 07:04 chaunceyjiang

https://pypi.org/project/vllm-flash-attn/

It’s been a long time since vllm-flash-attn had a new release. Should we consider publishing a new version?

chaunceyjiang avatar Apr 18 '25 07:04 chaunceyjiang

Hi, @LucasWilkinson @simon-mo. This PR uses a new function get_scheduler_metadata:

https://github.com/vllm-project/flash-attention/blob/main/vllm_flash_attn/flash_attn_interface.py#L78

It seems that a new release of vllm_flash_attn is needed.

We currently ship vllm_flash_attn inside the vLLM wheel, so you will likely need to rebuild from scratch: https://github.com/vllm-project/vllm/issues/16813#issuecomment-2814856926

LucasWilkinson avatar Apr 18 '25 13:04 LucasWilkinson

Hi, @LucasWilkinson. Does the latest FA3 still fail on Lovelace GPUs due to shared memory limits for some shapes?

nnding avatar Apr 21 '25 09:04 nnding