Alexander Matveev issues

Results 7 issues of


                                            Alexander Matveev

[Bugfix] Fine-tune gptq_marlin configs to be more similar to marlin

test_gptq_marlin.py compares "gptq" outputs vs "gptq_marlin" outputs. However, sometimes, they can diverge a bit in their outputs. This PR ensures gptq_marlin uses more similar K/N breakdown configs to the original...

Add GPTQ Marlin 2:4 sparse structured support

This PR adds a new GPTQ Marlin 2:4 sparse structured GPU kernel and a support to run 2:4 sparse models in vllm. Currently supported configs are: 1. group_size of 128...

Add marlin unit tests and marlin benchmark script

This PR ports marlin unit tests and benchmarking code from magic_wand, so we have all of the marlin code in one place.

[Bugfix][Kernel] Reduce GPU L1 cache pressure for act_order and tensor_parallel on smaller GPUs

Attempts to fix the bug reported in https://github.com/vllm-project/vllm/issues/6258

[Core][Bugfix][Perf] Introduce `MQLLMEngine` to avoid `asyncio` OH

SUMMARY: * Removed almost all the overhead from the `OpenAI` server, but still saw significant slowdown running in `AsyncLLMEngine` rather than `LLMEngine` on H100, including when we ran "headless" (e.g....

ready

Add output streaming support to multi-step + async

This PR adds output streaming support to multi-step + async. A first naive implementation of streaming with multi-step resulted in a significant performance degradation (almost 2x slower tpot), and after...

[Bugfix] Fix async log stats

This PR fixes the issue described here: https://github.com/vllm-project/vllm/issues/8219#issuecomment-2345021537 The fix is simple, we need to skip log stats for sequences that are already preempted since their state is reset to...

ready