Woosuk Kwon comments

Results 284 comments of


                                            Woosuk Kwon

[V1] V1 engine implements parallel sampling (AsyncLLM and LLMEngine)

@afeldman-nm > If prefix caching is enabled, an initial warmup request with max_tokens=1 will be sent to the engine to fill the prefix cache. Why do we need this? V1's...

[V1] V1 engine implements parallel sampling (AsyncLLM and LLMEngine)

@m-harmonic > Thanks for working on this. I haven't had a chance to look into the specifics of the new implementation but also wanted to ask about n>1 behavior as...

[V1] V1 engine implements parallel sampling (AsyncLLM and LLMEngine)

@afeldman-nm could you please check the failed tests and fix or re-run them?

[V1][Spec Decode] Optimize N-gram matching with Numba

cc @LiuXiaoxuanPKU This PR is ready. Could you please take a look?

[V1][Spec Decode] Optimize N-gram matching with Numba

@michaelfeil Thanks! Happy to see you again :) We still have some headroom for performance: #13498 Please let us know if you are interested in working on this.

[Bugfix] [Core] Fix zero temperature case (#5404 and part of #5898)

Wow this is amazing! Thanks for the thorough investigation!

[Bugfix] [Core] Fix zero temperature case (#5404 and part of #5898)

Sorry for the delay in review. LGTM overall, but I'd like to understand the problem in more detail. Let me get back to this within 1~2 days.

[MISC] Upgrade dependency to PyTorch 2.3.1

@comaniac vllm-flash-attn v2.5.9.post1 was built for PyTorch v2.3.1 and is now available in PyPI: https://pypi.org/project/vllm-flash-attn/

Support SSL Key Rotation in HTTP Server

@russellb @robertgshaw2-redhat could you please take a look? Unfortunately I have little background on this.

[V1][PP] Fix & Pin Ray version in requirements-cuda.txt

@youkaichao It seems to use cupy-cu12. However, IIUC, it doesn't break anything on our cu11.8 build unless the user explicitly chooses Ray?