vllm issues

Possibility of Passing Prompts as List[str] to AsyncEngine.generate()

Hi! Thank you for your amazing framework! I have tried serving a GPT BigCode model using vllm together with ray following the example: https://github.com/ray-project/ray/blob/3d3183d944424a960a2c6ce048abd1316c901c1e/doc/source/serve/doc_code/vllm_example.py And in my use case the...

FerdinandZhong

good first issue

feature request

Question about sampler. It takes too much time

3

I noticed that, the sampler stage uses lots of repeated cuda kernels. Seems you do sampling in a for loop, launch each kernel for a sequence? Why is this? BTW,...

sleepwalker2017

when to support chatglm2-6b?

2

chatglm-6b(chatglm2-6b) is a very popular Chinese LLM. Do you have a plan？

liukaiyueyuo

new model

LangChain and LlamaIndex support

3

Excellent job, it made my LLM blazing fast. I tried it on T4 (16GB vRAM) and it seems to lower inference time from 36 secs to just 9 secs. I...

ktolias

Require a "Wrapper" feature

It will be perfect to have a wrapper function to turn the model into the vllm-enhanced model. (like PEFT). It is useful if we have a lora model, we can...

jeffchy

Question about efficient memory sharing (prefix sharing)

10

I have a question about the feature of efficient memory sharing. Does different sequences that sharing the same system prompt but splicing different user-input texts share the computation and memory...

xyfZzz

feature request

Support for chatglm-6b

2

It would be great if you could support chatglm-6b，It's a popular chinese model。 https://huggingface.co/THUDM/chatglm-6b

datalee

new model

[Quality] Add code formatter and linter

1

Partially fix #57. Adding formatter and linter. TODO: Add formatter into CI.

zhuohan123

I found a potential error in the source code.

In the file scheduler.py， I find this ` num_batched_tokens = sum( seq_group.num_seqs(status=SequenceStatus.RUNNING) for seq_group in self.running ) ` and this ` # If the number of batched tokens exceeds the...

wenjunlong

torch.cuda.DeferredCudaCallError when installing vllm

Hi, I'm trying to run vllm on a 4-GPU Linux machine. When I followed the Installation guide to `pip install vllm`, I got this error: ``` torch.cuda.DeferredCudaCallError: CUDA call failed...

caronzh03

Installation

vllm
vllm copied to clipboard

Metadata

Possibility of Passing Prompts as List[str] to AsyncEngine.generate()

Question about sampler. It takes too much time

when to support chatglm2-6b?

LangChain and LlamaIndex support

Require a "Wrapper" feature

Question about efficient memory sharing (prefix sharing)

Support for chatglm-6b

[Quality] Add code formatter and linter

I found a potential error in the source code.

torch.cuda.DeferredCudaCallError when installing vllm

← Metadata

Owner

Metadata

vllm vllm copied to clipboard

Metadata

← Metadata

Owner

Metadata

vllm
vllm copied to clipboard