Woosuk Kwon

Results 151 comments of Woosuk Kwon

I'll keep this issue though, as we haven't got the "efficient" implementation of MQA yet.

@andreapiso In the latest release and henceforth, vLLM is published with pre-built CUDA binaries. Please try out `pip install vllm` and let us know if it does not work for...

@sleepcoo Awesome! Thanks for your contribution! Before I get into review, could you double-check the new kernel produces correct outputs? When I tested it out, it didn't match our reference...

Hi @sleepcoo, Is the bug fixed now? We will add the code format checker later. 🙏 Could you wrap up this PR first?

Hi @sleepcoo, thanks for submitting the PR and sorry for the delay in my review. I left some comments on the code style. BTW, could you update your PR branch...

Closing the PR since It is pretty old and we'd like to stick to the current RMSNorm implementation that upcasts the data type to FP32 during computation.

Without the fix, cacheflow can suddenly hang without any notice. Let's fix this.

Hi @merrymercy, thanks for letting us know. Please feel free to contribute.

The same error occurs when creating two LLMs: ```python LLM(model="facebook/opt-125m") LLM(model="facebook/opt-125m") # RuntimeError: trying to initialize the default process group twice! ```

@Joejoequ Thanks for reporting it! I think in your case, the problem can be easily solved by installing CUDA 11.8 version of PyTorch: ```bash pip3 install torch --index-url https://download.pytorch.org/whl/cu118 ```