Woosuk Kwon
Woosuk Kwon
@armsp vLLM does not support quantization at the moment. However, could you let us know the data type & quantization method you use for the models? That will definitely help...
@zhuohan123 Are you still on this PR?
@chu-tianxiang Thanks for letting me know! I think we should focus on the 4-bit support in this PR and work on the other bit widths in the future PRs. Could...
@robertgshaw2-neuralmagic Please take a look at the PR!
Hi @bigPYJ1151 Thanks for updating the PR! It looks really nice. Just for other people's understanding, could you write an RFC about the overall design, supported features, key technical decisions,...
@bigPYJ1151 @zhouyuan QQ: Can we use `torch.compile` to auto-generate the custom C++ kernels except PagedAttention? This would increase the maintainability of the code a lot. I'm wondering how `torch.compile` performs...
@pcmoritz Do you happen to try `--enforce-eager` on the main brach? I'm wondering whether this memory leak is due to CUDA graph or due to the fix in #2151.
@DarkLight1337 @ywang96 @rkooo567 Is this PR ready for merge?
Hi @amulil, `gpu_memory_utilization` means the ratio of the GPU memory you want to allow for vLLM. vLLM will use it to store weights, allocate some workspace, and allocate KV cache....
@amulil vLLM does not support `use_cache=False`. I believe there is no reason to disable KV cache because it is a pure optimization that significantly reduces the FLOPs of generation.