Woosuk Kwon issues

Results 65 issues of


                                            Woosuk Kwon

[RFC]: Deprecation of the `best_of` Sampling Parameter in vLLM V1

### Motivation. ### Overview As we transition to vLLM V1, we plan to discontinue support for the `best_of` sampling parameter. This decision is driven by a combination of low usage,...

RFC

[V1][Bug]: TP with Ray does not terminate gracefully

### Your current environment The output of `python collect_env.py` ```text Your output of `python collect_env.py` here ``` ### 🐛 Describe the bug When using Ray as the distributed executor backend...

bug

ray

[V1][PP] Fix & Pin Ray version in requirements-cuda.txt

Pipeline parallelism in V1 requires `ray[adag]` instead of `ray[default]`. Also, because of the API changes in 2.42.0, we have to pin the version to `2.41.0` (or 2.40.0).

ci/build

[Bug][V1]: TP is broken when torch compile cache is used

### Your current environment The output of `python collect_env.py` ```text Your output of `python collect_env.py` here ``` ### 🐛 Describe the bug Got the error message when using tp_size=4: ```...

bug

[V1][PP] Enable true PP with Ray executor

cc @comaniac @ruisearch42

[V1] Support rejection sampling for non-greedy requests

### 🚀 The feature, motivation and pitch Currently, the V1 rejection sampler only supports greedy sampling. We need to expand it to random sampling. I think we can do this...

feature request

[V1] Optimize rejection sampler

### 🚀 The feature, motivation and pitch The current V1 rejection sampler is not optimized enough, taking unnecessary overheads. In my benchmarks, this takes 10-25% of the overall running time....

feature request

[V1] Support DeepSeek MTP in V1

### 🚀 The feature, motivation and pitch DeepSeek MTP should be ported to the new V1 architecture. ### Alternatives _No response_ ### Additional context _No response_ ### Before submitting a...

feature request

[RFC] Initial Support for Cloud TPUs

# Progress - [x] Implement TPU executor that works on a single TPU chip (without tensor parallelism) #5292 - [x] Support single-host tensor parallel inference #5871 - [x] Support multi-host...

RFC

tpu

stale

[RFC]: Drop support for prompt adapter

### Motivation. For code cleanup, we plan to drop the support for prompt adapter. Please let us know if you are using this feature. ### Proposed Change. Dropping the prompt...

RFC