Alexander Matveev
Alexander Matveev
test_gptq_marlin.py compares "gptq" outputs vs "gptq_marlin" outputs. However, sometimes, they can diverge a bit in their outputs. This PR ensures gptq_marlin uses more similar K/N breakdown configs to the original...
This PR adds a new GPTQ Marlin 2:4 sparse structured GPU kernel and a support to run 2:4 sparse models in vllm. Currently supported configs are: 1. group_size of 128...
This PR ports marlin unit tests and benchmarking code from magic_wand, so we have all of the marlin code in one place.
Attempts to fix the bug reported in https://github.com/vllm-project/vllm/issues/6258
SUMMARY: * Removed almost all the overhead from the `OpenAI` server, but still saw significant slowdown running in `AsyncLLMEngine` rather than `LLMEngine` on H100, including when we ran "headless" (e.g....
This PR adds output streaming support to multi-step + async. A first naive implementation of streaming with multi-step resulted in a significant performance degradation (almost 2x slower tpot), and after...
This PR fixes the issue described here: https://github.com/vllm-project/vllm/issues/8219#issuecomment-2345021537 The fix is simple, we need to skip log stats for sequences that are already preempted since their state is reset to...