vllm icon indicating copy to clipboard operation
vllm copied to clipboard

A high-throughput and memory-efficient inference and serving engine for LLMs

Results 2816 vllm issues
Sort by recently updated
recently updated
newest added

I got this message when trying out vllm with windows; `No CUDA runtime is found, using CUDA_HOME='C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\bin'` Cuda is installed and available in the directory. Does...

help wanted
Installation

Really impressive results 👏 Any plan to support cpu only mode? Thus it could be used on commodity laptops such as Macbook Pro.

I noticed that we use conditions like this to check whether it is greedy sampling https://github.com/WoosukKwon/cacheflow/blob/189ae231336857bcc4c6f6157bf7868cdf56fb5f/cacheflow/sampling_params.py#L45 However, I guess this will result in several problems 1. It is not recommended...

good first issue
P1

As mentioned in https://github.com/WoosukKwon/cacheflow/pull/81#issuecomment-1546980281, the current PyTorch-based top-k and top-p implementation is memory-inefficient. This can be improved by introducing custom kernels.

help wanted
performance

Currently we call `torch.distributed.init_process_group` even for a single GPU. This is redundant and causes errors when the LLM object is created multiple times.

bug

I failed to build the system with the latest NVIDIA PyTorch docker image. The reason is PyTorch installed by `pip` is built with CUDA 11.7 while the container uses CUDA...

Installation

We need tests for the models we support. The tests should ensure that the outputs of our models when using greedy sampling are equivalent to those of HF models.

P1
test