Woosuk Kwon

Results 284 comments of Woosuk Kwon

@afeldman-nm > If prefix caching is enabled, an initial warmup request with max_tokens=1 will be sent to the engine to fill the prefix cache. Why do we need this? V1's...

@m-harmonic > Thanks for working on this. I haven't had a chance to look into the specifics of the new implementation but also wanted to ask about n>1 behavior as...

@afeldman-nm could you please check the failed tests and fix or re-run them?

cc @LiuXiaoxuanPKU This PR is ready. Could you please take a look?

@michaelfeil Thanks! Happy to see you again :) We still have some headroom for performance: #13498 Please let us know if you are interested in working on this.

Wow this is amazing! Thanks for the thorough investigation!

Sorry for the delay in review. LGTM overall, but I'd like to understand the problem in more detail. Let me get back to this within 1~2 days.

@comaniac vllm-flash-attn v2.5.9.post1 was built for PyTorch v2.3.1 and is now available in PyPI: https://pypi.org/project/vllm-flash-attn/

@russellb @robertgshaw2-redhat could you please take a look? Unfortunately I have little background on this.

@youkaichao It seems to use cupy-cu12. However, IIUC, it doesn't break anything on our cu11.8 build unless the user explicitly chooses Ray?