Woosuk Kwon comments

Results 151 comments of


                                            Woosuk Kwon

How does this compare to MQA (multi-query attention)?

I'll keep this issue though, as we haven't got the "efficient" implementation of MQA yet.

Publish wheels with pre-built CUDA binaries

@andreapiso In the latest release and henceforth, vLLM is published with pre-built CUDA binaries. Please try out `pip install vllm` and let us know if it does not work for...

Add half rmsnorm kernel

@sleepcoo Awesome! Thanks for your contribution! Before I get into review, could you double-check the new kernel produces correct outputs? When I tested it out, it didn't match our reference...

Add half rmsnorm kernel

Hi @sleepcoo, Is the bug fixed now? We will add the code format checker later. 🙏 Could you wrap up this PR first?

Add half rmsnorm kernel

Hi @sleepcoo, thanks for submitting the PR and sorry for the delay in my review. I left some comments on the code style. BTW, could you update your PR branch...

Add half rmsnorm kernel

Closing the PR since It is pretty old and we'd like to stick to the current RMSNorm implementation that upcasts the data type to FP32 during computation.

Check whether the input request is too long

Without the fix, cacheflow can suddenly hang without any notice. Let's fix this.

Dangerous floating point comparison

Hi @merrymercy, thanks for letting us know. Please feel free to contribute.

Do not initialize process group when using a single GPU

The same error occurs when creating two LLMs: ```python LLM(model="facebook/opt-125m") LLM(model="facebook/opt-125m") # RuntimeError: trying to initialize the default process group twice! ```

Build failure due to CUDA version mismatch

@Joejoequ Thanks for reporting it! I think in your case, the problem can be easily solved by installing CUDA 11.8 version of PyTorch: ```bash pip3 install torch --index-url https://download.pytorch.org/whl/cu118 ```