vllm
vllm copied to clipboard
[Performance]: reproducing vLLM performance benchmark
Proposal to improve performance
No response
Report of performance regression
No response
Misc discussion on performance
To reproduce vLLM's performance benchmark, please launch a shell in the following docker images:
- SGlang:
lmsysorg/sglang:v0.3.0-cu124
- lmdeploy:
openmmlab/lmdeploy:v0.6.0a0-cu12
- TensorRT-LLM:
nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3
- vLLM:
vllm/vllm-openai:v0.6.0
And then run the following bash script (don't forget to replace <your HF TOKEN> with your huggingface token that has Llama-3 model access):
export HF_TOKEN=<your HF TOKEN>
apt update
apt install -y wget unzip
# download benchmarking code
wget -O benchmarking_code.zip https://buildkite.com/organizations/vllm/pipelines/performance-benchmark/builds/8532/jobs/0191bbbf-c603-4c15-9f5d-e0b2933ba097/artifacts/0191bd2a-d6cd-4f6d-b618-a7aa2c39456c
unzip benchmarking_code.zip
# remove previous results
rm -r ./benchmarks/results
VLLM_SOURCE_CODE_LOC=$(pwd) bash .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh
Your benchmarking results will be in ./benchmarks/results
, with the name format of xxx_nightly_results.json
and can be loaded and converted to pandas dataframe by pandas.DataFrame.from_dict()
. Each benchmark run takes roughly 1 hour 10 minutes assuming that the model weights are already downloaded (and 1 hour 30 minutes for TensorRT-LLM as it needs to convert the model to triton inference engine).
When you run the H100 benchmark inside TensorRT-LLM docker container, you may experience a memory leaking issue (issue link). In this case, please add the following code
# temporary fix for trt
kill_gpu_processes
bash -c "python3 /tensorrtllm_backend/scripts/launch_triton_server.py \
--world_size=${tp} \
--model_repo=/tensorrtllm_backend/triton_model_repo & " </dev/null >/dev/null 2>&1 &
wait_for_server
to Line 211 (right after the for loop) in ./.buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh
to force TensorRT-LLM to restart the serve more often.
Known issue:
- In different serving engines, the # of output tokens do not strictly align (even after setting
ignore_eos
ormax_length
due to imperfect implementation of these two flags in different engines). That said, the number of tokens generated by vLLM is roughly aligned with other engines as all engines are performing greedy sampling using the same model.
Your current environment (if you think it is necessary)
The output of `python collect_env.py`
Before submitting a new issue...
- [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.