Question about sampler. It takes too much time
I noticed that, the sampler stage uses lots of repeated cuda kernels. Seems you do sampling in a for loop, launch each kernel for a sequence? Why is this? BTW, do you compare the performance with FasterTransformer? I didn't see about this. Thank you!
below is my code:
path = '/data/llm/hf-llama-7b/'
llm = LLM(model=path)
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
sampling_params.max_tokens = 1
cnt = 1
start = time.time()
for i in range(cnt):
with nvtx.annotate("generate", color="red"):
outputs = llm.generate(prompt_token_ids = input_ids, sampling_params = sampling_params)
end = time.time()
prefill_ticks = (end - start) / cnt
@sleepwalker2017 Thanks for trying out vLLM and reporting the performance issue! Yes, our sampler is indeed not optimized well yet. Particularly, vLLM performs sampling for one request at a time, because each request can have different sampling parameters. For example, request A may use a top-p sampling while request B in the same batch may use beam search with beam width 6. In such a case, it's not possible to simultaneously process the sampling operations for the two requests. Instead, vLLM process one request at a time. This can incur non-negligible overhead in latency, when you run small models.
That being said, your profiling result is very weird. Could you provide more information about the input_ids you used (e.g., number of sequences, sequence length)?
Please refer to #264 for the comparison with FasterTransformer.
#264
@sleepwalker2017 Thanks for trying out vLLM and reporting the performance issue! Yes, our sampler is indeed not optimized well yet. Particularly, vLLM performs sampling for one request at a time, because each request can have different sampling parameters. For example, request A may use a top-p sampling while request B in the same batch may use beam search with beam width 6. In such a case, it's not possible to simultaneously process the sampling operations for the two requests. Instead, vLLM process one request at a time. This can incur non-negligible overhead in latency, when you run small models.
That being said, your profiling result is very weird. Could you provide more information about the
input_idsyou used (e.g., number of sequences, sequence length)?
Of course, I can provide the input_ids.
Actually it's no special. I use batch = 128, seq_len = 32. I upload my test inputs. input_ids.txt
Closing this issue as stale as there has been no discussion in the past 3 months.
If you are still experiencing the issue you describe, feel free to re-open this issue.