Congcong Chen
Congcong Chen
This PR is a feature enrichment of https://github.com/vllm-project/vllm/pull/4799 which introduces blocksparse flash attention, Microsoft Phi-3-Small-8K and Phi-3-Small-128K models. This PR modifies the block-sparse attention prefill Triton kernel to add prefix-caching...
New model for https://huggingface.co/microsoft/Phi-4-multimodal-instruct/tree/main co-author: [Jacob Platin](https://github.com/jrplatin) and [Vadim Mazalov](https://github.com/vmazalov) for speech encoder and [Yen-Chun Chen](https://github.com/ChenRocks) for vision encoder. FIX #13936
### Your current environment The output of `python collect_env.py` ```text (myenv) aiscuser@node-0:~/vllm$ python collect_env.py Collecting environment information... PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1...
Now with [Phi-4-multimodal-instruct](https://github.com/vllm-project/vllm/pull/14119#top) merged into main, we would like to have another PR to address the high latency overhead we have observed for Phi4-multimod when using LoRA. Benchmark results with...