vllm icon indicating copy to clipboard operation
vllm copied to clipboard

Waiting sequence group should have only one prompt sequence.

Open Link-Li opened this issue 2 years ago • 6 comments

I encountered the following error while using vllm to run baichuan, always after running for a while:

"Waiting sequence group should have only one prompt sequence."

Could you please tell me why this happens?

I use V100 32GB GPU and set batch size as 4 like this:

prompt_ids = [[195, ..., 196], [195, ..., 196], [195, ..., 196], [195, ..., 196]]
sampling_params = SamplingParams(n=3, temperature=0.3, top_p=0.85, top_k=5, max_tokens=2048, presence_penalty=1.1)
output_list = llm.generate(None, sampling_params, prompt_id_list, use_tqdm=False)

Thank you !

Link-Li avatar Sep 18 '23 13:09 Link-Li

I observe the same error with CodeLlama-7b as well.

awasthiabhijeet avatar Oct 10 '23 06:10 awasthiabhijeet

CC: @WoosukKwon @zhuohan123

awasthiabhijeet avatar Oct 10 '23 06:10 awasthiabhijeet

I found the same problem. It occurs when setting n (the number of sequences returned) greater than 1, and occurs frequently when there is less gpu memory. A simple solution is to copy the prompt and set n to 1, but will lose some speed. There should be a better way.

TechxGenus avatar Nov 20 '23 14:11 TechxGenus

I'm running into this as well—it seems to be more prevalent with larger models and also shows up when using best_of.

kevinhu avatar Nov 28 '23 23:11 kevinhu

After some digging, the bug seems to be related to calling _preempt_by_recompute from _preempt, which inserts sequence groups at the front of the waiting queue. (But based on the TODO there, vLLM doesn't support recomputation for groups with multiple sequences?)

A quick fix is to force _preempt to only use PreemptionMode.SWAP, which fixes the error for me—but that's probably not ideal.

kevinhu avatar Nov 29 '23 01:11 kevinhu

I encountered this with Mistral 7b on an A10 using AsyncLLMEngine when pending requests increased above 0. Removing n and best_of from the SamplingParams is a workaround.

SamHjelmfelt avatar Jan 05 '24 01:01 SamHjelmfelt

Faced this issue with codellama 13B (bfloat16) on an A100 80GB GPU with 64GB of CPU swap when using n=1 and best_of=16 for generated length 512. Ultimately also had to set best_of to 1.

iNeil77 avatar Jan 24 '24 20:01 iNeil77

I experience that too also when running llama3 8b with SamplingParams(n=1, ...) and call model in parallel.. in general, i think it relates to this issue so it's something with the default_scheduler ..

would love to get some help here

tsvisab avatar May 04 '24 14:05 tsvisab