vllm issues

_pickle.UnpicklingError: could not find MARK when loading wizardcoder

Thanks for fixing #254 . After I updated the code to the latest version, when I executed the following command: ``` python -m vllm.entrypoints.openai.api_server --model /home/foo/workshop/text-generation-webui/models/WizardLM_WizardCoder-15B-V1.0/ ``` the following error...

ishotoli

higher latency than TGI

Is it normal to have higher latency than TGI with a low concurrency, such as 1 or 4?

gravitywp

How to set ParallelConfig and SchedulerConfig?

1

Is ParallelConfig.pipeline_parallel_size used on multiple gpu cards? Can it be set to the number of GPU cards? Does it relate to processing multiple prompts and generating multiple results in parallel?...

wjy3326

Support for longchat-7b-16k

It would be great, if you can add support for longchat(from Fastchat) models, which have 16k context length: https://github.com/lm-sys/FastChat https://github.com/lm-sys/FastChat/blob/6d06351542bc0c3701d54619e6df4c26aa91a260/fastchat/model/llama_condense_monkey_patch.py#L10C18-L10C18

yguo33

new model

RuntimeError: probability tensor contains either `inf`, `nan` or element < 0 when running mpt-7b

Hi Everyone. I'm trying to use the fresh new MPT-7b included in vllm. I'm running on SageMaker Studio, in a g4dn.2xlarge instance, however, I'm getting the following error: `RuntimeError: probability...

paulovasconcellos-hotmart

[Server] use get_conversation_template to make model template

1

we can use get_conversation_template method, then we don't need to specified chat_template, when new models add, we don't need to change our code, because the get_conversation_template method will first find...

akxxsb

Weird beam search outputs

1

While playing with it I've stumbled upon strange behavior that might indicate that there is some issue when the beam search is used. I've started the server with: `python3 -m...

WoosukKwon

Why is the generation time so long?

I use the openbuddy llama 7B model for generation. The first generation of 269 tokens took 5.9 seconds, the second generation of 550 tokens took 12.3 seconds, but there was...

wjy3326

Why using LLM class to load models requires much more memory than using huggingface from_pretrained method?

3

I tried the code `llm = LLM(model="facebook/opt-125m")` on a single T4 and found the memory cost exceeded 11GB, while using huggingface code `model = AutoModel.from_pretrained("facebook/opt-125m").cuda()` only cost 1GB memory. How...

canghongjian

The A100 test performance did not match the official test results

I install vllm with ```shell pip install vllm ``` then use that command start server ```shell CUDA_VISIBLE_DEVICES=7 python -m vllm.entrypoints.api_server --model llama-7b-hf/ --swap-space 16 --disable-log-requests --port 9009 ``` benchmark with...

zhaohb

vllm
vllm copied to clipboard

Metadata

_pickle.UnpicklingError: could not find MARK when loading wizardcoder

higher latency than TGI

How to set ParallelConfig and SchedulerConfig?

Support for longchat-7b-16k

RuntimeError: probability tensor contains either `inf`, `nan` or element < 0 when running mpt-7b

[Server] use get_conversation_template to make model template

Weird beam search outputs

Why is the generation time so long?

Why using LLM class to load models requires much more memory than using huggingface from_pretrained method?

The A100 test performance did not match the official test results

← Metadata

Owner

Metadata

vllm vllm copied to clipboard

Metadata

← Metadata

Owner

Metadata

vllm
vllm copied to clipboard