Woosuk Kwon
Woosuk Kwon
@lucasjinreal Is your model different from the original LLaMA? If not, you can simply pass the path to your model weights in `llm = LLM(model=)` and use the `llm` object...
@liujuncn Thanks for your feedback. We'll describe more details in the doc. In order to address your issue quickly, could you share with us the specific model you're interested in...
@dcruiz01 @SunixLiu @AlpinDale vLLM is designed to take almost all of your GPU memory. Could you double-check your GPU is not used by other processes when using vLLM?
@AlpinDale Good question. You can use the `tensor_parallel_size` argument for multi-GPU inference. First, initialize your Ray cluster by executing ```bash $ ray start --head ``` Then, use the `tensor_parallel_size` argument...
@wangkuiyi @ruidongtd @createmomo @nuass @wengrx @rossbucky @bsabri We've just added BLOOM. You can immediately use it by [installing vLLM from source](https://vllm.readthedocs.io/en/latest/getting_started/installation.html#build-from-source).
Hi @Hukongtao, thanks for trying out vLLM! The memory usage is high because vLLM pre-allocates the space to store KV cache. You can configure the memory usage by tuning the...
Thanks for the feature request! Quantization is not currently supported, but it's definitely on our roadmap. Please stay tuned.
Hi @Matthieu-Tinycoaching, thanks for bringing it up! As mentioned in #187, T5 support is definitely on our roadmap. The current blocker is its encoder-decoder architecture, which vLLM's current implementation does...
Thanks for your interest! PagedAttention is more like an implementation of an attention algorithm. Thus, it is also applicable to MQA and can save a lot of memory waste. We...
@xpl vLLM now supports StarCoder thanks to @michaelfeil. Please try it out!