Kuntai Du
Kuntai Du
### Motivation. There are more and more use cases, where we need to transfer KV caches between vLLM instances, or store KV caches for future use. Some concrete use cases:...
This is a follow-up PR for #5557 . Goal: implement disaggregated prefilling by launching 2 vllm instances (one for prefilling, one for decoding), and forward the KV cache from prefilling...
Following PR #5073, this PR aims to compare `vllm` and alternatives (like tgi, tensorrt-llm and lmdeploy --- feel free to comment if you feel there are also other alternatives we...
### System Info I am working on the benchmarking suite in vLLM team, and now trying to run TensorRT-LLM for comparison. I am relying on this github repo (https://github.com/neuralmagic/tensorrt-demo) to...
### System Info Docker image: nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3 Device: 8x H100 trt-llm backend: v0.11.0 ### Who can help? @byshiue @schetlur-nv ### Information - [ ] The official example scripts - [X] My...
### Proposal to improve performance _No response_ ### Report of performance regression _No response_ ### Misc discussion on performance To reproduce vLLM's performance benchmark, please launch a shell in the...
Update performance benchmark: upgrade trt-llm to r24.07, and add SGLang **BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE** --- PR Checklist (Click to Expand) Thank...
This PR deprecates block manager v1 and makes block manager v2 the default to simplify the code path. This is supported by this [benchmark](https://docs.google.com/document/d/1XxYUFai07ta5rE7OdtCVhLJ5J0oAxEqrGgarFdjv0Zc/edit?usp=sharing), where block manager v2 is 500...
TL; DR: implemented disaggregated prefill with ** 500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with...