vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Performance]: reproducing vLLM performance benchmark

Open KuntaiDu opened this issue 5 months ago • 6 comments

Proposal to improve performance

No response

Report of performance regression

No response

Misc discussion on performance

To reproduce vLLM's performance benchmark, please launch a shell in the following docker images:

  • SGlang: lmsysorg/sglang:v0.3.0-cu124
  • lmdeploy: openmmlab/lmdeploy:v0.6.0a0-cu12
  • TensorRT-LLM: nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3
  • vLLM: vllm/vllm-openai:v0.6.0

And then run the following bash script (don't forget to replace <your HF TOKEN> with your huggingface token that has Llama-3 model access):

export HF_TOKEN=<your HF TOKEN>
apt update
apt install -y wget unzip 
# download benchmarking code
wget -O benchmarking_code.zip https://buildkite.com/organizations/vllm/pipelines/performance-benchmark/builds/8532/jobs/0191bbbf-c603-4c15-9f5d-e0b2933ba097/artifacts/0191bd2a-d6cd-4f6d-b618-a7aa2c39456c
unzip benchmarking_code.zip
# remove previous results
rm -r ./benchmarks/results
VLLM_SOURCE_CODE_LOC=$(pwd) bash .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh

Your benchmarking results will be in ./benchmarks/results, with the name format of xxx_nightly_results.json and can be loaded and converted to pandas dataframe by pandas.DataFrame.from_dict(). Each benchmark run takes roughly 1 hour 10 minutes assuming that the model weights are already downloaded (and 1 hour 30 minutes for TensorRT-LLM as it needs to convert the model to triton inference engine).

When you run the H100 benchmark inside TensorRT-LLM docker container, you may experience a memory leaking issue (issue link). In this case, please add the following code

      # temporary fix for trt
      kill_gpu_processes
      bash -c "python3 /tensorrtllm_backend/scripts/launch_triton_server.py \
              --world_size=${tp} \
              --model_repo=/tensorrtllm_backend/triton_model_repo & " </dev/null >/dev/null 2>&1 &
      wait_for_server

to Line 211 (right after the for loop) in ./.buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh to force TensorRT-LLM to restart the serve more often.

Known issue:

  • In different serving engines, the # of output tokens do not strictly align (even after setting ignore_eos or max_length due to imperfect implementation of these two flags in different engines). That said, the number of tokens generated by vLLM is roughly aligned with other engines as all engines are performing greedy sampling using the same model.

Your current environment (if you think it is necessary)

The output of `python collect_env.py`

Before submitting a new issue...

  • [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

KuntaiDu avatar Sep 05 '24 04:09 KuntaiDu