Woosuk Kwon

Results 151 comments of Woosuk Kwon

@armsp vLLM does not support quantization at the moment. However, could you let us know the data type & quantization method you use for the models? That will definitely help...

@zhuohan123 Are you still on this PR?

@chu-tianxiang Thanks for letting me know! I think we should focus on the 4-bit support in this PR and work on the other bit widths in the future PRs. Could...

@robertgshaw2-neuralmagic Please take a look at the PR!

Hi @bigPYJ1151 Thanks for updating the PR! It looks really nice. Just for other people's understanding, could you write an RFC about the overall design, supported features, key technical decisions,...

@bigPYJ1151 @zhouyuan QQ: Can we use `torch.compile` to auto-generate the custom C++ kernels except PagedAttention? This would increase the maintainability of the code a lot. I'm wondering how `torch.compile` performs...

@pcmoritz Do you happen to try `--enforce-eager` on the main brach? I'm wondering whether this memory leak is due to CUDA graph or due to the fix in #2151.

@DarkLight1337 @ywang96 @rkooo567 Is this PR ready for merge?

Hi @amulil, `gpu_memory_utilization` means the ratio of the GPU memory you want to allow for vLLM. vLLM will use it to store weights, allocate some workspace, and allocate KV cache....

@amulil vLLM does not support `use_cache=False`. I believe there is no reason to disable KV cache because it is a pure optimization that significantly reduces the FLOPs of generation.