vllm icon indicating copy to clipboard operation
vllm copied to clipboard

A high-throughput and memory-efficient inference and serving engine for LLMs

Results 2816 vllm issues
Sort by recently updated
recently updated
newest added

Thanks for fixing #254 . After I updated the code to the latest version, when I executed the following command: ``` python -m vllm.entrypoints.openai.api_server --model /home/foo/workshop/text-generation-webui/models/WizardLM_WizardCoder-15B-V1.0/ ``` the following error...

Is it normal to have higher latency than TGI with a low concurrency, such as 1 or 4?

Is ParallelConfig.pipeline_parallel_size used on multiple gpu cards? Can it be set to the number of GPU cards? Does it relate to processing multiple prompts and generating multiple results in parallel?...

It would be great, if you can add support for longchat(from Fastchat) models, which have 16k context length: https://github.com/lm-sys/FastChat https://github.com/lm-sys/FastChat/blob/6d06351542bc0c3701d54619e6df4c26aa91a260/fastchat/model/llama_condense_monkey_patch.py#L10C18-L10C18

new model

Hi Everyone. I'm trying to use the fresh new MPT-7b included in vllm. I'm running on SageMaker Studio, in a g4dn.2xlarge instance, however, I'm getting the following error: `RuntimeError: probability...

we can use get_conversation_template method, then we don't need to specified chat_template, when new models add, we don't need to change our code, because the get_conversation_template method will first find...

While playing with it I've stumbled upon strange behavior that might indicate that there is some issue when the beam search is used. I've started the server with: `python3 -m...

I use the openbuddy llama 7B model for generation. The first generation of 269 tokens took 5.9 seconds, the second generation of 550 tokens took 12.3 seconds, but there was...

I tried the code `llm = LLM(model="facebook/opt-125m")` on a single T4 and found the memory cost exceeded 11GB, while using huggingface code `model = AutoModel.from_pretrained("facebook/opt-125m").cuda()` only cost 1GB memory. How...

I install vllm with ```shell pip install vllm ``` then use that command start server ```shell CUDA_VISIBLE_DEVICES=7 python -m vllm.entrypoints.api_server --model llama-7b-hf/ --swap-space 16 --disable-log-requests --port 9009 ``` benchmark with...