Congcong Chen

Results 4 issues of Congcong Chen

This PR is a feature enrichment of https://github.com/vllm-project/vllm/pull/4799 which introduces blocksparse flash attention, Microsoft Phi-3-Small-8K and Phi-3-Small-128K models. This PR modifies the block-sparse attention prefill Triton kernel to add prefix-caching...

New model for https://huggingface.co/microsoft/Phi-4-multimodal-instruct/tree/main co-author: [Jacob Platin](https://github.com/jrplatin) and [Vadim Mazalov](https://github.com/vmazalov) for speech encoder and [Yen-Chun Chen](https://github.com/ChenRocks) for vision encoder. FIX #13936

documentation
frontend
ci/build

### Your current environment The output of `python collect_env.py` ```text (myenv) aiscuser@node-0:~/vllm$ python collect_env.py Collecting environment information... PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1...

bug
stale

Now with [Phi-4-multimodal-instruct](https://github.com/vllm-project/vllm/pull/14119#top) merged into main, we would like to have another PR to address the high latency overhead we have observed for Phi4-multimod when using LoRA. Benchmark results with...