vllm
vllm copied to clipboard
A high-throughput and memory-efficient inference and serving engine for LLMs
Hi! Thank you for your amazing framework! I have tried serving a GPT BigCode model using vllm together with ray following the example: https://github.com/ray-project/ray/blob/3d3183d944424a960a2c6ce048abd1316c901c1e/doc/source/serve/doc_code/vllm_example.py And in my use case the...
I noticed that, the sampler stage uses lots of repeated cuda kernels. Seems you do sampling in a for loop, launch each kernel for a sequence? Why is this? BTW,...
chatglm-6b(chatglm2-6b) is a very popular Chinese LLM. Do you have a plan?
Excellent job, it made my LLM blazing fast. I tried it on T4 (16GB vRAM) and it seems to lower inference time from 36 secs to just 9 secs. I...
It will be perfect to have a wrapper function to turn the model into the vllm-enhanced model. (like PEFT). It is useful if we have a lora model, we can...
I have a question about the feature of efficient memory sharing. Does different sequences that sharing the same system prompt but splicing different user-input texts share the computation and memory...
It would be great if you could support chatglm-6b,It's a popular chinese model。 https://huggingface.co/THUDM/chatglm-6b
Partially fix #57. Adding formatter and linter. TODO: Add formatter into CI.
In the file scheduler.py, I find this ` num_batched_tokens = sum( seq_group.num_seqs(status=SequenceStatus.RUNNING) for seq_group in self.running ) ` and this ` # If the number of batched tokens exceeds the...
Hi, I'm trying to run vllm on a 4-GPU Linux machine. When I followed the Installation guide to `pip install vllm`, I got this error: ``` torch.cuda.DeferredCudaCallError: CUDA call failed...