vllm
vllm copied to clipboard
[Attention] Update to lastest FA3 code that supports different K and V head dims
Based on: https://github.com/vllm-project/vllm/pull/14253 merge that first
Update to new FA3 which supports mixed headdims so we dont have to pad for prefill
VLLM_USE_V1=1 VLLM_USE_TRITON_FLASH_ATTN=0 lm_eval --model vllm --model_args pretrained=deepseek-ai/DeepSeek-V2-Lite-Chat,tensor_parallel_size=2,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True,max_model_len=16384,enforce_eager=True,max_num_batched_tokens=1024 --task gsm8k --num_fewshot=5 --limit 10
|Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ | 0.8|± |0.1333|
| | |strict-match | 5|exact_match|↑ | 0.8|± |0.1333|
VLLM_USE_V1=1 VLLM_USE_TRITON_FLASH_ATTN=1 lm_eval --model vllm --model_args pretrained=deepseek-ai/DeepSeek-V2-Lite-Chat,tensor_parallel_size=2,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True,max_model_len=16384,enforce_eager=True,max_num_batched_tokens=1024 --task gsm8k --num_fewshot=5 --limit 10
|Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ | 0.8|± |0.1333|
| | |strict-match | 5|exact_match|↑ | 0.8|± |0.1333|
VLLM_USE_V1=0 VLLM_USE_TRITON_FLASH_ATTN=0 lm_eval --model vllm --model_args pretrained=deepseek-ai/DeepSeek-V2-Lite-Chat,tensor_parallel_size=2,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True
,max_model_len=16384,enforce_eager=True,max_num_batched_tokens=1024,enable_chunked_prefill=True --task gsm8k --num_fewshot=5 --limit 10
|Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ | 0.8|± |0.1333|
| | |strict-match | 5|exact_match|↑ | 0.8|± |0.1333|
VLLM_USE_V1=0 VLLM_USE_TRITON_FLASH_ATTN=1 lm_eval --model vllm --model_args pretrained=deepseek-ai/DeepSeek-V2-Lite-Chat,tensor_parallel_size=2,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True,max_model_len=16384,enforce_eager=True,max_num_batched_tokens=1024,enable_chunked_prefill=True --task gsm8k --num_fewshot=5 --limit 10
|Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ | 0.8|± |0.1333|
| | |strict-match | 5|exact_match|↑ | 0.8|± |0.1333|
👋 Hi! Thank you for contributing to the vLLM project.
💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.
Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.
To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.
🚀
@khluu can we run the perf CI on this? would be nice to check for regressions since theres alot of FA changes
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @LucasWilkinson.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @LucasWilkinson.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @LucasWilkinson.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
not seeing much speedup (minor TFTT at very long contexts)
8xH200, DeepSeek-R1, VLLM_USE_V1=1 VLLM_ATTENTION_BACKEND=FLASHMLA VLLM_USE_FLASHINFER_SAMPLER=1
Main:
backend input_tokens output_tokens output_toks/s req/s median_itl_ms median_ttft_ms
2 vllm 1000 1000 1095.323697 1.095324 40.931626 149.658605
1 vllm 5000 1000 517.327850 0.517328 39.956240 5627.535715
3 vllm 10000 1000 315.639817 0.315640 39.697455 57821.907031
0 vllm 32000 1000 106.821047 0.106821 40.109005 193232.262791
This PR:
backend input_tokens output_tokens output_toks/s req/s median_itl_ms median_ttft_ms
2 vllm 1000 1000 1034.089425 1.034089 39.884763 2365.846871
1 vllm 5000 1000 518.861897 0.518862 39.906694 5312.827895
3 vllm 10000 1000 317.878565 0.317879 40.045355 57160.958344
0 vllm 32000 1000 110.514999 0.110515 40.538517 184969.578100
still worth landing in preparation for: https://github.com/vllm-project/vllm/pull/14258
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @LucasWilkinson.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @LucasWilkinson.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @LucasWilkinson.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
Hi, @LucasWilkinson @simon-mo. This PR uses a new function get_scheduler_metadata:
https://github.com/vllm-project/flash-attention/blob/main/vllm_flash_attn/flash_attn_interface.py#L78
It seems that a new release of vllm_flash_attn is needed.
https://pypi.org/project/vllm-flash-attn/
It’s been a long time since vllm-flash-attn had a new release. Should we consider publishing a new version?
Hi, @LucasWilkinson @simon-mo. This PR uses a new function
get_scheduler_metadata:https://github.com/vllm-project/flash-attention/blob/main/vllm_flash_attn/flash_attn_interface.py#L78
It seems that a new release of
vllm_flash_attnis needed.
We currently ship vllm_flash_attn inside the vLLM wheel, so you will likely need to rebuild from scratch: https://github.com/vllm-project/vllm/issues/16813#issuecomment-2814856926
Hi, @LucasWilkinson. Does the latest FA3 still fail on Lovelace GPUs due to shared memory limits for some shapes?