Woosuk Kwon comments

Results 278 comments of


                                            Woosuk Kwon

How integrate with hf with minial modification?

@lucasjinreal Is your model different from the original LLaMA? If not, you can simply pass the path to your model weights in `llm = LLM(model=)` and use the `llm` object...

How integrate with hf with minial modification?

@liujuncn Thanks for your feedback. We'll describe more details in the doc. In order to address your issue quickly, could you share with us the specific model you're interested in...

CUDA error: out of memory

@dcruiz01 @SunixLiu @AlpinDale vLLM is designed to take almost all of your GPU memory. Could you double-check your GPU is not used by other processes when using vLLM?

@AlpinDale Good question. You can use the `tensor_parallel_size` argument for multi-GPU inference. First, initialize your Ray cluster by executing ```bash $ ray start --head ``` Then, use the `tensor_parallel_size` argument...

Support BLOOM

@wangkuiyi @ruidongtd @createmomo @nuass @wengrx @rossbucky @bsabri We've just added BLOOM. You can immediately use it by [installing vLLM from source](https://vllm.readthedocs.io/en/latest/getting_started/installation.html#build-from-source).

Support BLOOM

Hi @Hukongtao, thanks for trying out vLLM! The memory usage is high because vLLM pre-allocates the space to store KV cache. You can configure the memory usage by tuning the...

GPTQ / Quantization support?

Thanks for the feature request! Quantization is not currently supported, but it's definitely on our roadmap. Please stay tuned.

Support for fastchat-t5-3b-v1.0

Hi @Matthieu-Tinycoaching, thanks for bringing it up! As mentioned in #187, T5 support is definitely on our roadmap. The current blocker is its encoder-decoder architecture, which vLLM's current implementation does...

How does this compare to MQA (multi-query attention)?

Thanks for your interest! PagedAttention is more like an implementation of an attention algorithm. Thus, it is also applicable to MQA and can save a lot of memory waste. We...

How does this compare to MQA (multi-query attention)?

@xpl vLLM now supports StarCoder thanks to @michaelfeil. Please try it out!