vllm Question about sampler. It takes too much time

I noticed that, the sampler stage uses lots of repeated cuda kernels. Seems you do sampling in a for loop, launch each kernel for a sequence? Why is this? BTW, do you compare the performance with FasterTransformer? I didn't see about this. Thank you!

below is my code:

path = '/data/llm/hf-llama-7b/'
llm = LLM(model=path)
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
sampling_params.max_tokens = 1
cnt = 1
start = time.time()
for i in range(cnt):
    with nvtx.annotate("generate", color="red"):
        outputs = llm.generate(prompt_token_ids = input_ids, sampling_params = sampling_params)
end = time.time()
prefill_ticks = (end - start) / cnt

Jun 26 '23 03:06 sleepwalker2017

@sleepwalker2017 Thanks for trying out vLLM and reporting the performance issue! Yes, our sampler is indeed not optimized well yet. Particularly, vLLM performs sampling for one request at a time, because each request can have different sampling parameters. For example, request A may use a top-p sampling while request B in the same batch may use beam search with beam width 6. In such a case, it's not possible to simultaneously process the sampling operations for the two requests. Instead, vLLM process one request at a time. This can incur non-negligible overhead in latency, when you run small models.

That being said, your profiling result is very weird. Could you provide more information about the input_ids you used (e.g., number of sequences, sequence length)?

Jun 26 '23 17:06 WoosukKwon

Please refer to #264 for the comparison with FasterTransformer.

Jun 26 '23 20:06 zhuohan123

#264

@sleepwalker2017 Thanks for trying out vLLM and reporting the performance issue! Yes, our sampler is indeed not optimized well yet. Particularly, vLLM performs sampling for one request at a time, because each request can have different sampling parameters. For example, request A may use a top-p sampling while request B in the same batch may use beam search with beam width 6. In such a case, it's not possible to simultaneously process the sampling operations for the two requests. Instead, vLLM process one request at a time. This can incur non-negligible overhead in latency, when you run small models.

That being said, your profiling result is very weird. Could you provide more information about the input_ids you used (e.g., number of sequences, sequence length)?

Of course, I can provide the input_ids.

Actually it's no special. I use batch = 128, seq_len = 32. I upload my test inputs. input_ids.txt

Jun 27 '23 02:06 sleepwalker2017

Closing this issue as stale as there has been no discussion in the past 3 months.

If you are still experiencing the issue you describe, feel free to re-open this issue.

Mar 08 '24 10:03 hmellor