vllm
vllm copied to clipboard
vLLM is 4x faster than HF for offline inference
Thanks for the great project.
I gave a try and compared with hf's offline inference speed on 100 alpaca examples. The hardware I used is a single v100-40G GPU. Here is my script for vLLM:
sampling_params = SamplingParams(temperature=0.1, top_p=0.75, top_k=40, max_tokens=128, ignore_eos=True)
llm = LLM(model="openlm-research/open_llama_13b")
# Prepare dataset.
start_time = time.time()
for data in my_dataset:
# Set ignore_eos to True so it generates max_tokens.
llm.generate(data, sampling_params, ignore_eos=True)
end_time = time.time()
and for hf:
model = LlamaForCausalLM.from_pretrained("openlm-research/open_llama_7b")
tokenizer = ...
generation_config = GenerationConfig(temperature=temperature, top_p=top_p, top_k=top_k, ...)
# Prepare dataset.
start_time = time.time()
for data in my_dataset:
input_ids = tokenizer(data)["input_ids"]
model.generate(input_ids, generation_config, max_new_tokens=128)
end_time = time.time()
| API | Model Size | Time (minutes) |
|---|---|---|
| HF | 7B | 12.7 |
| vLLM | 7B | 3.1 |
| HF | 13B | 15.8 |
| vLLM | 13B | 5.3 |
It seems that the speedup is ~3-4x (not 25x). Am I missing a special setup for vLLM? Thanks.
Hi @flyman3046, thanks for trying out vLLM! Could you try this
llm.generate(my_dataaset, sampling_params, ignore_eos=True)
instead of the for loop? In fact, the LLM class internally maintains a queue of input sequences and automatically batches the sequences whenever a sequence is finished. This is one of the factors that make vLLM significantly faster than HF. Please try this out!
Gave it another try with llm.generate(my_dataaset, ...) and it indeed speeds up by quite a lot from 180 seconds to 16 seconds.
A follow-up question: how much the speedup is due to batching? how much due to other improvements? Is it a fair comparison if hf does not use batch whereas vLLM does? Thanks again!
@flyman3046 Thanks for sharing your experience! We use a more sophisticated batching mechanism than the traditional batching mechanism. In short, vLLM does not wait until all the sequences in a batch finish, but packs incoming sequences whenever a sequence in the batch finishes. This leads to 3x-10x throughput improvement in our experience. To implement this on top of HF, you need to re-write the model code and develop some special CUDA kernels, which vLLM did.
@WoosukKwon another quesion, please help me. in every loop if print(llm.generate(data, sampling_params, ignore_eos=True)), I can get the answer immediately, so I think this means the generation is not batched among the loops. Is there something wrong about this?