vllm issues

Bug: Windows installation

1

I got this message when trying out vllm with windows; `No CUDA runtime is found, using CUDA_HOME='C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\bin'` Cuda is installed and available in the directory. Does...

0xqd

help wanted

Installation

Any plan to support cpu only mode?

1

Really impressive results 👏 Any plan to support cpu only mode? Thus it could be used on commodity laptops such as Macbook Pro.

happy15

Dangerous floating point comparison

1

I noticed that we use conditions like this to check whether it is greedy sampling https://github.com/WoosukKwon/cacheflow/blob/189ae231336857bcc4c6f6157bf7868cdf56fb5f/cacheflow/sampling_params.py#L45 However, I guess this will result in several problems 1. It is not recommended...

merrymercy

good first issue

P1

Implement custom kernels for top-k and top-p sampling

As mentioned in https://github.com/WoosukKwon/cacheflow/pull/81#issuecomment-1546980281, the current PyTorch-based top-k and top-p implementation is memory-inefficient. This can be improved by introducing custom kernels.

WoosukKwon

help wanted

performance

Do not initialize process group when using a single GPU

2

Currently we call `torch.distributed.init_process_group` even for a single GPU. This is redundant and causes errors when the LLM object is created multiple times.

WoosukKwon

bug

Check whether the input request is too long

3

zhuohan123

bug

Add a baseline with dynamic growing KV cache size for the paper

zhuohan123

Build failure due to CUDA version mismatch

3

I failed to build the system with the latest NVIDIA PyTorch docker image. The reason is PyTorch installed by `pip` is built with CUDA 11.7 while the container uses CUDA...

WoosukKwon

Installation

Add tests for models

We need tests for the models we support. The tests should ensure that the outputs of our models when using greedy sampling are equivalent to those of HF models.

WoosukKwon

P1

test

Add tests for sampler

WoosukKwon

test

vllm
vllm copied to clipboard

Metadata

Bug: Windows installation

Any plan to support cpu only mode?

Dangerous floating point comparison

Implement custom kernels for top-k and top-p sampling

Do not initialize process group when using a single GPU

Check whether the input request is too long

Add a baseline with dynamic growing KV cache size for the paper

Build failure due to CUDA version mismatch

Add tests for models

Add tests for sampler

← Metadata

Owner

Metadata

vllm vllm copied to clipboard

Metadata

← Metadata

Owner

Metadata

vllm
vllm copied to clipboard