vllm icon indicating copy to clipboard operation
vllm copied to clipboard

A high-throughput and memory-efficient inference and serving engine for LLMs

Results 2816 vllm issues
Sort by recently updated
recently updated
newest added

Hi! Thank you for your amazing framework! I have tried serving a GPT BigCode model using vllm together with ray following the example: https://github.com/ray-project/ray/blob/3d3183d944424a960a2c6ce048abd1316c901c1e/doc/source/serve/doc_code/vllm_example.py And in my use case the...

good first issue
feature request

I noticed that, the sampler stage uses lots of repeated cuda kernels. Seems you do sampling in a for loop, launch each kernel for a sequence? Why is this? BTW,...

chatglm-6b(chatglm2-6b) is a very popular Chinese LLM. Do you have a plan?

new model

Excellent job, it made my LLM blazing fast. I tried it on T4 (16GB vRAM) and it seems to lower inference time from 36 secs to just 9 secs. I...

It will be perfect to have a wrapper function to turn the model into the vllm-enhanced model. (like PEFT). It is useful if we have a lora model, we can...

I have a question about the feature of efficient memory sharing. Does different sequences that sharing the same system prompt but splicing different user-input texts share the computation and memory...

feature request

It would be great if you could support chatglm-6b,It's a popular chinese model。 https://huggingface.co/THUDM/chatglm-6b

new model

Partially fix #57. Adding formatter and linter. TODO: Add formatter into CI.

In the file scheduler.py, I find this ` num_batched_tokens = sum( seq_group.num_seqs(status=SequenceStatus.RUNNING) for seq_group in self.running ) ` and this ` # If the number of batched tokens exceeds the...

Hi, I'm trying to run vllm on a 4-GPU Linux machine. When I followed the Installation guide to `pip install vllm`, I got this error: ``` torch.cuda.DeferredCudaCallError: CUDA call failed...

Installation