Woosuk Kwon
Woosuk Kwon
### Motivation. ### Overview As we transition to vLLM V1, we plan to discontinue support for the `best_of` sampling parameter. This decision is driven by a combination of low usage,...
### Your current environment The output of `python collect_env.py` ```text Your output of `python collect_env.py` here ``` ### 🐛 Describe the bug When using Ray as the distributed executor backend...
Pipeline parallelism in V1 requires `ray[adag]` instead of `ray[default]`. Also, because of the API changes in 2.42.0, we have to pin the version to `2.41.0` (or 2.40.0).
### Your current environment The output of `python collect_env.py` ```text Your output of `python collect_env.py` here ``` ### 🐛 Describe the bug Got the error message when using tp_size=4: ```...
### 🚀 The feature, motivation and pitch Currently, the V1 rejection sampler only supports greedy sampling. We need to expand it to random sampling. I think we can do this...
### 🚀 The feature, motivation and pitch The current V1 rejection sampler is not optimized enough, taking unnecessary overheads. In my benchmarks, this takes 10-25% of the overall running time....
### 🚀 The feature, motivation and pitch DeepSeek MTP should be ported to the new V1 architecture. ### Alternatives _No response_ ### Additional context _No response_ ### Before submitting a...
# Progress - [x] Implement TPU executor that works on a single TPU chip (without tensor parallelism) #5292 - [x] Support single-host tensor parallel inference #5871 - [x] Support multi-host...
### Motivation. For code cleanup, we plan to drop the support for prompt adapter. Please let us know if you are using this feature. ### Proposed Change. Dropping the prompt...