Woosuk Kwon comments

Results 151 comments of


                                            Woosuk Kwon

What is the correct way to use quantized versions of vicuna or guanco?

@armsp vLLM does not support quantization at the moment. However, could you let us know the data type & quantization method you use for the models? That will definitely help...

GPTQ & AWQ Fused MOE

@chu-tianxiang Thanks for letting me know! I think we should focus on the 4-bit support in this PR and work on the other bit widths in the future PRs. Could...

GPTQ & AWQ Fused MOE

@robertgshaw2-neuralmagic Please take a look at the PR!

[Hardware][Intel] Add CPU inference backend

Hi @bigPYJ1151 Thanks for updating the PR! It looks really nice. Just for other people's understanding, could you write an RFC about the overall design, supported features, key technical decisions,...

[Hardware][Intel] Add CPU inference backend

@bigPYJ1151 @zhouyuan QQ: Can we use `torch.compile` to auto-generate the custom C++ kernels except PagedAttention? This would increase the maintainability of the code a lot. I'm wondering how `torch.compile` performs...

Memory leak when using CUDA Graph with torch.distributed.all_reduce (vLLM default config)

@pcmoritz Do you happen to try `--enforce-eager` on the main brach? I'm wondering whether this memory leak is due to CUDA graph or due to the fix in #2151.

[Bugfix][Model] Add base class for vision-language models

@DarkLight1337 @ywang96 @rkooo567 Is this PR ready for merge?

I want to close kv cache. if i set gpu_memory_utilization is 0. Does it means that i close the kv cache?

Hi @amulil, `gpu_memory_utilization` means the ratio of the GPU memory you want to allow for vLLM. vLLM will use it to store weights, allocate some workspace, and allocate KV cache....

I want to close kv cache. if i set gpu_memory_utilization is 0. Does it means that i close the kv cache?

@amulil vLLM does not support `use_cache=False`. I believe there is no reason to disable KV cache because it is a pure optimization that significantly reduces the FLOPs of generation.