Woosuk Kwon
Woosuk Kwon
This PR introduces `torch.compile` for the following basic custom ops: activations and RMSNorm. The main goals are: 1. Reduce the number of custom kernels maintained by vLLM. (I intentionally kept...
Should be merged after #9437 and after the 10/17 version of PyTorch XLA nightly is available. This PR upgrades the PyTorch XLA, and uses the `peak_bytes_used` to correctly profile the...
This PR changes the scheduler and model runner so that the model runner gets the input token IDs from the scheduler. This change is especially useful when the token IDs...
### Anything you want to discuss about vllm. To switch the engine from V0 to V1, we need to comprehensively support the sampling parameters in https://github.com/vllm-project/vllm/blob/main/vllm/sampling_params.py While most of the...
If I understand correctly, we should cache the intermediate tensors and reuse them for CUDA graphs. This could be another reason why current PP is not working correctly. cc @comaniac...
The KV cache manager in V1 ignores sliding window, so prefix caching is compatible with sliding window attention.
This PR optimizes the N-gram matching algorithm by JIT compiling it with Numba. I've observed 20-30x speedup with large batch sizes: For ShareGPT benchmark with 5K requests, the cumulative overhead...