vllm
vllm copied to clipboard
A high-throughput and memory-efficient inference and serving engine for LLMs
I got this message when trying out vllm with windows; `No CUDA runtime is found, using CUDA_HOME='C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\bin'` Cuda is installed and available in the directory. Does...
Really impressive results 👏 Any plan to support cpu only mode? Thus it could be used on commodity laptops such as Macbook Pro.
I noticed that we use conditions like this to check whether it is greedy sampling https://github.com/WoosukKwon/cacheflow/blob/189ae231336857bcc4c6f6157bf7868cdf56fb5f/cacheflow/sampling_params.py#L45 However, I guess this will result in several problems 1. It is not recommended...
As mentioned in https://github.com/WoosukKwon/cacheflow/pull/81#issuecomment-1546980281, the current PyTorch-based top-k and top-p implementation is memory-inefficient. This can be improved by introducing custom kernels.
Currently we call `torch.distributed.init_process_group` even for a single GPU. This is redundant and causes errors when the LLM object is created multiple times.
I failed to build the system with the latest NVIDIA PyTorch docker image. The reason is PyTorch installed by `pip` is built with CUDA 11.7 while the container uses CUDA...
We need tests for the models we support. The tests should ensure that the outputs of our models when using greedy sampling are equivalent to those of HF models.