Congcong Chen issues

Results 4 issues of


                                            Congcong Chen

[Kernel] Add prefix-caching support for phi-3-small-8k/128k model triton kernel

This PR is a feature enrichment of https://github.com/vllm-project/vllm/pull/4799 which introduces blocksparse flash attention, Microsoft Phi-3-Small-8K and Phi-3-Small-128K models. This PR modifies the block-sparse attention prefill Triton kernel to add prefix-caching...

[Model] New model support for Phi-4-multimodal-instruct

New model for https://huggingface.co/microsoft/Phi-4-multimodal-instruct/tree/main co-author: [Jacob Platin](https://github.com/jrplatin) and [Vadim Mazalov](https://github.com/vmazalov) for speech encoder and [Yen-Chun Chen](https://github.com/ChenRocks) for vision encoder. FIX #13936

documentation

frontend

ci/build

[Bug]: Phi-3-small-128k-instruct on 1 A100 GPUs - Assertion error: Does not support prefix-enabled attention.

### Your current environment The output of `python collect_env.py` ```text (myenv) aiscuser@node-0:~/vllm$ python collect_env.py Collecting environment information... PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1...

bug

stale

[Kernel] Add trition.autotune to address the high latency overhead of punica kernels

Now with [Phi-4-multimodal-instruct](https://github.com/vllm-project/vllm/pull/14119#top) merged into main, we would like to have another PR to address the high latency overhead we have observed for Phi4-multimod when using LoRA. Benchmark results with...