Congcong Chen comments

Results 10 comments of


                                            Congcong Chen

[Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel

Hi @LucasWilkinson , with this PR merged, now I could not build vLLM from source any more with the following errors, could you please help to look into that? The...

[Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel

@mgoin , PR https://github.com/vllm-project/vllm/pull/7730 is not working for me unfortunately.

[Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel

Sorry the temporary log was deleted automatically. And looks like after I apply PR https://github.com/vllm-project/vllm/pull/7730, the error is different. See ``` ets->outlines=0.0.43->vllm==0.5.4+cu118) (2024.1) Requirement already satisfied: six>=1.5 in /home/aiscuser/.local/lib/python3.10/site-packages (from...

[Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel

Not sure why this PR triggers build of MacheteKernel for me locally. Looks like a bug. Can we revert this PR since it affects other users. cc @simon-mo

[Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel

@LucasWilkinson , https://github.com/vllm-project/vllm/pull/7757 doesn't work as well. With the patch, now the build is successfully, but I failed to run vLLM server now, see error below: ``` (myenv) aiscuser@node-0:~/vllm/benchmarks$ python...

[New Model]: Phi-4 Multimodal Instruct

Feel free to check out the PR description [here](https://github.com/vllm-project/vllm/pull/14119) for steps on: 1. Starting the server with the base model and vision/speech LoRA weights. 2. Sending requests to the OpenAI-compatible...

[New Model]: Phi-4 Multimodal Instruct

We also found that LoRA(Punica) kernels are quite slow for [Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct). Here’s the fix: [PR #14272](https://github.com/vllm-project/vllm/pull/14272). With this fix, we observed up to a 5x improvement in generation speed.

Congcong Chen

[Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel

[Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel

[Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel

[Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel

[Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel

[New Model]: Phi-4 Multimodal Instruct

[New Model]: Phi-4 Multimodal Instruct

[Bug]: Phi-3-small-128k-instruct on 1 A100 GPUs - Assertion error: Does not support prefix-enabled attention.

[Bug]: Phi-3-small-128k-instruct on 1 A100 GPUs - Assertion error: Does not support prefix-enabled attention.

[Kernel] Add trition.autotune to address the high latency overhead of punica kernels