lmdeploy
lmdeploy copied to clipboard
[Benchmark] benchmarks on different cuda architecture with models of various size
背景
我们发现绝大部分LLM推理引擎在报告推理性能的时候,都是关掉sampling功能的。但是在实际应用中,sampling几乎是必选项。为了给出尽可能贴近实际应用的benchmark,我们开了这个issue,报告 LMDeploy 在采样开启时候的性能。
测试模型
- llama2-7b
- llama2-13b
- internlm-20b
- llama2-70b
测试设备
- A100 模型计算精度:BF16(FP16)、W4A16、KV8
- V100 模型计算精度:FP16
- 4090 模型计算精度:W4A16
- 3090 模型计算精度:W4A16
- 2080 模型计算精度:W4A16
测量指标
- 静态推理性能(out token/s):在固定batch、输入输出 token 数的前提下,每秒产生的token数量
- 每秒处理请求数量(request/s):SharedGPT对话数据,不定长的 prompt 和 response。我们会测试 2 种接口:一种是 api_server 的 RESTful API,一种是 localhost 上的 Python API
采样(num_beam=1)感觉是不是对性能影响不大啊?
采样(num_beam=1)感觉是不是对性能影响不大啊?
我理解是 temperature, top_p, top_k 这样的setting
采样(num_beam=1)感觉是不是对性能影响不大啊?
我理解是 temperature, top_p, top_k 这样的setting
我使用了不同的top_p, top_k和temperature在llama-2-chat-7b模型tp1下使用profile_throughtput.py测试了性能,tokens/s几乎没有差异
A100 (w4a16)
Request Throughput (RPM)
model | batch | tp | num_promts | RPS | RPM | FTL(ave)(s) | FTL(min)(s) | FTL(max)(s) | 50%(s) | 75%(s) | 95%(s) | 99%(s) | throughput(out tok/s) | throughput(total tok/s) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
llama-7b | 64 | 1 | 3000 | 12.083 | 725.005 | 0.199 | 0.027 | 2.393 | 0.008 | 0.022 | 0.052 | 0.339 | 2811.948 | 5795.166 |
128 | 1 | 3000 | 13.375 | 802.511 | 0.341 | 0.052 | 4.029 | 0.022 | 0.046 | 0.098 | 0.380 | 3112.555 | 6414.690 | |
llama2-13b | 64 | 1 | 3000 | 7.980 | 478.805 | 0.130 | 0.036 | 2.077 | 0.026 | 0.031 | 0.086 | 0.138 | 1857.054 | 3827.217 |
128 | 1 | 3000 | 8.370 | 502.200 | 0.385 | 0.069 | 4.405 | 0.051 | 0.071 | 0.146 | 0.212 | 1947.793 | 4014.223 | |
internlm-20b | 64 | 1 | 3000 | 6.333 | 379.977 | 0.241 | 0.055 | 10.015 | 0.038 | 0.046 | 0.128 | 0.188 | 1263.609 | 2674.010 |
128 | 1 | 3000 | 6.310 | 378.589 | 2.236 | 0.083 | 9.626 | 0.067 | 0.094 | 0.204 | 0.289 | 1258.992 | 2664.239 | |
llama2-70b | 64 | 4 | 3000 | 5.355 | 321.290 | 0.245 | 0.063 | 3.595 | 0.036 | 0.041 | 0.129 | 0.213 | 1246.131 | 2568.162 |
128 | 4 | 3000 | 6.484 | 389.064 | 0.455 | 0.078 | 6.471 | 0.058 | 0.075 | 0.196 | 0.280 | 1508.993 | 3109.897 |
Static Inference Performance
llama2-7b
batch | tp | prompt_tokens | completion_tokens | throughput(out tok/s) | mem(GB) | FTL(ave)(s) | FTL(min)(s) | FTL(max)(s) | 50%(s) | 75%(s) | 95%(s) | 99%(s) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | 128 | 260.80 | 67.77 | 0.004 | 0.004 | 0.005 | 0.004 | 0.004 | 0.004 | 0.004 |
1 | 1 | 128 | 128 | 245.91 | 67.77 | 0.013 | 0.012 | 0.014 | 0.004 | 0.004 | 0.004 | 0.005 |
1 | 1 | 128 | 2048 | 226.59 | 67.77 | 0.013 | 0.013 | 0.013 | 0.005 | 0.005 | 0.005 | 0.005 |
1 | 1 | 2048 | 128 | 159.96 | 67.99 | 0.196 | 0.13 | 0.516 | 0.005 | 0.005 | 0.005 | 0.005 |
1 | 1 | 2048 | 2048 | 197.86 | 67.99 | 0.131 | 0.13 | 0.132 | 0.005 | 0.005 | 0.005 | 0.005 |
16 | 1 | 1 | 128 | 3326.22 | 67.80 | 0.01 | 0.007 | 0.014 | 0.005 | 0.005 | 0.006 | 0.006 |
16 | 1 | 128 | 128 | 2491.98 | 67.99 | 0.108 | 0.012 | 0.145 | 0.005 | 0.006 | 0.006 | 0.008 |
16 | 1 | 128 | 2048 | 1583.80 | 67.99 | 0.1 | 0.015 | 0.144 | 0.01 | 0.013 | 0.015 | 0.016 |
16 | 1 | 2048 | 128 | 518.54 | 69.46 | 1.43 | 0.133 | 2.032 | 0.015 | 0.015 | 0.016 | 0.017 |
16 | 1 | 2048 | 2048 | 784.66 | 69.36 | 1.437 | 0.134 | 2.044 | 0.019 | 0.022 | 0.024 | 0.025 |
32 | 1 | 1 | 128 | 4841.70 | 67.83 | 0.014 | 0.008 | 0.025 | 0.006 | 0.007 | 0.008 | 0.011 |
32 | 1 | 128 | 128 | 3288.00 | 68.18 | 0.193 | 0.018 | 0.263 | 0.008 | 0.008 | 0.01 | 0.011 |
32 | 1 | 128 | 2048 | 1867.68 | 68.15 | 0.194 | 0.019 | 0.277 | 0.017 | 0.022 | 0.026 | 0.028 |
32 | 1 | 2048 | 128 | 548.20 | 69.49 | 1.878 | 0.134 | 4.079 | 0.027 | 0.028 | 0.029 | 0.912 |
32 | 1 | 2048 | 2048 | 837.42 | 69.49 | 1.807 | 0.132 | 4.083 | 0.036 | 0.041 | 0.045 | 0.047 |
64 | 1 | 1 | 128 | 6576.58 | 67.90 | 0.031 | 0.009 | 0.056 | 0.01 | 0.016 | 0.024 | 0.03 |
64 | 1 | 128 | 128 | 4098.99 | 68.52 | 0.377 | 0.015 | 0.531 | 0.013 | 0.018 | 0.027 | 0.037 |
64 | 1 | 128 | 2048 | 2093.60 | 69.11 | 0.417 | 0.02 | 0.737 | 0.029 | 0.038 | 0.046 | 0.049 |
64 | 1 | 2048 | 128 | 568.93 | 69.49 | 2.811 | 0.133 | 13.776 | 0.044 | 0.046 | 0.177 | 1.046 |
64 | 1 | 2048 | 2048 | 828.56 | 69.49 | 34.994 | 0.133 | 104.059 | 0.044 | 0.045 | 0.047 | 0.051 |
llama2-13b
batch | tp | prompt_tokens | completion_tokens | throughput(out tok/s) | mem(GB) | FTL(ave)(s) | FTL(min)(s) | FTL(max)(s) | 50%(s) | 75%(s) | 95%(s) | 99%(s) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | 128 | 157.79 | 57.66 | 0.007 | 0.007 | 0.008 | 0.006 | 0.006 | 0.006 | 0.007 |
1 | 1 | 128 | 128 | 151.50 | 61.63 | 0.021 | 0.021 | 0.023 | 0.006 | 0.006 | 0.007 | 0.007 |
1 | 1 | 128 | 2048 | 140.05 | 59.16 | 0.022 | 0.021 | 0.022 | 0.007 | 0.007 | 0.008 | 0.008 |
1 | 1 | 2048 | 128 | 105.74 | 57.91 | 0.238 | 0.237 | 0.24 | 0.008 | 0.008 | 0.008 | 0.008 |
1 | 1 | 2048 | 2048 | 122.68 | 57.91 | 0.238 | 0.237 | 0.239 | 0.008 | 0.008 | 0.008 | 0.008 |
16 | 1 | 1 | 128 | 2051.60 | 57.66 | 0.015 | 0.01 | 0.025 | 0.008 | 0.008 | 0.009 | 0.009 |
16 | 1 | 128 | 128 | 1493.19 | 57.91 | 0.224 | 0.022 | 0.264 | 0.009 | 0.009 | 0.01 | 0.011 |
16 | 1 | 128 | 2048 | 999.76 | 57.91 | 0.198 | 0.022 | 0.281 | 0.016 | 0.02 | 0.023 | 0.024 |
16 | 1 | 2048 | 128 | 301.19 | 59.72 | 2.704 | 0.239 | 3.829 | 0.023 | 0.023 | 0.024 | 0.025 |
16 | 1 | 2048 | 2048 | 489.79 | 59.72 | 2.478 | 0.241 | 3.849 | 0.03 | 0.034 | 0.036 | 0.037 |
32 | 1 | 1 | 128 | 2993.08 | 57.69 | 0.02 | 0.013 | 0.031 | 0.01 | 0.011 | 0.013 | 0.014 |
32 | 1 | 128 | 128 | 1996.37 | 58.16 | 0.42 | 0.022 | 0.505 | 0.012 | 0.013 | 0.015 | 0.017 |
32 | 1 | 128 | 2048 | 1165.21 | 58.56 | 0.729 | 0.022 | 1.176 | 0.026 | 0.033 | 0.038 | 0.04 |
32 | 1 | 2048 | 128 | 310.99 | 59.78 | 3.512 | 0.24 | 12.731 | 0.038 | 0.039 | 0.041 | 1.004 |
32 | 1 | 2048 | 2048 | 478.93 | 60.82 | 32.547 | 0.235 | 90.296 | 0.037 | 0.038 | 0.04 | 0.041 |
64 | 1 | 1 | 128 | 4229.19 | 57.78 | 0.038 | 0.01 | 0.065 | 0.015 | 0.018 | 0.026 | 0.032 |
64 | 1 | 128 | 128 | 2500.53 | 58.53 | 0.684 | 0.029 | 0.967 | 0.018 | 0.02 | 0.024 | 0.038 |
64 | 1 | 128 | 2048 | 1182.01 | 59.59 | 6.725 | 0.028 | 52.618 | 0.038 | 0.041 | 0.044 | 0.054 |
64 | 1 | 2048 | 128 | 312.75 | 59.72 | 15.559 | 0.241 | 25.265 | 0.038 | 0.039 | 0.041 | 1.701 |
64 | 1 | 2048 | 2048 | 471.09 | 97.87 | 158.007 | 0.239 | 255.386 | 0.038 | 0.038 | 0.04 | 0.042 |
internlm-20b
batch | tp | prompt_tokens | completion_tokens | throughput(out tok/s) | mem(GB) | FTL(ave)(s) | FTL(min)(s) | FTL(max)(s) | 50%(s) | 75%(s) | 95%(s) | 99%(s) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | 128 | 102.44 | 70.05 | 0.011 | 0.01 | 0.011 | 0.01 | 0.01 | 0.01 | 0.011 |
1 | 1 | 128 | 128 | 98.88 | 92.22 | 0.032 | 0.032 | 0.033 | 0.01 | 0.01 | 0.01 | 0.011 |
1 | 1 | 128 | 2048 | 91.28 | 342.14 | 0.032 | 0.032 | 0.033 | 0.011 | 0.011 | 0.012 | 0.012 |
1 | 1 | 2048 | 128 | 69.28 | 69.81 | 0.361 | 0.36 | 0.361 | 0.012 | 0.012 | 0.012 | 0.012 |
1 | 1 | 2048 | 2048 | 80.07 | 69.81 | 0.362 | 0.361 | 0.363 | 0.012 | 0.013 | 0.013 | 0.013 |
16 | 1 | 1 | 128 | 1330.03 | 69.63 | 0.021 | 0.011 | 0.03 | 0.012 | 0.012 | 0.013 | 0.014 |
16 | 1 | 128 | 128 | 979.30 | 69.84 | 0.33 | 0.032 | 0.399 | 0.013 | 0.014 | 0.015 | 0.016 |
16 | 1 | 128 | 2048 | 659.21 | 69.97 | 0.344 | 0.032 | 0.409 | 0.024 | 0.03 | 0.034 | 0.036 |
16 | 1 | 2048 | 128 | 199.12 | 73.31 | 4.307 | 0.364 | 5.812 | 0.035 | 0.035 | 0.036 | 0.037 |
16 | 1 | 2048 | 2048 | 308.87 | 73.47 | 5.686 | 0.363 | 42.356 | 0.042 | 0.044 | 0.045 | 0.046 |
32 | 1 | 1 | 128 | 1974.15 | 69.69 | 0.028 | 0.016 | 0.041 | 0.016 | 0.017 | 0.019 | 0.021 |
32 | 1 | 128 | 128 | 1309.96 | 70.13 | 0.559 | 0.035 | 0.771 | 0.018 | 0.02 | 0.022 | 0.026 |
32 | 1 | 128 | 2048 | 738.76 | 368.22 | 2.114 | 0.033 | 26.537 | 0.037 | 0.045 | 0.048 | 0.049 |
32 | 1 | 2048 | 128 | 200.29 | 73.59 | 10.016 | 0.363 | 17.883 | 0.046 | 0.047 | 0.049 | 0.429 |
32 | 1 | 2048 | 2048 | 306.08 | 73.56 | 88.279 | 0.362 | 173.383 | 0.044 | 0.045 | 0.047 | 0.05 |
64 | 1 | 1 | 128 | 2808.92 | 69.84 | 0.041 | 0.014 | 0.06 | 0.022 | 0.024 | 0.028 | 0.03 |
64 | 1 | 128 | 128 | 1651.45 | 70.38 | 1.082 | 0.04 | 1.479 | 0.027 | 0.029 | 0.033 | 0.037 |
64 | 1 | 128 | 2048 | 736.56 | 205.43 | 22.127 | 0.035 | 83.859 | 0.048 | 0.05 | 0.053 | 0.273 |
64 | 1 | 2048 | 128 | 199.68 | 73.88 | 29.365 | 0.359 | 36.276 | 0.047 | 0.047 | 0.049 | 0.427 |
64 | 1 | 2048 | 2048 | 305.56 | 73.81 | 283.211 | 0.362 | 391.207 | 0.044 | 0.045 | 0.047 | 0.048 |
llama2-70b
batch | tp | prompt_tokens | completion_tokens | throughput(out tok/s) | mem(GB) | FTL(ave)(s) | FTL(min)(s) | FTL(max)(s) | 50%(s) | 75%(s) | 95%(s) | 99%(s) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 4 | 1 | 128 | 72.79 | 74.98 | 0.016 | 0.014 | 0.017 | 0.014 | 0.014 | 0.014 | 0.015 |
1 | 4 | 128 | 128 | 70.26 | 74.98 | 0.047 | 0.047 | 0.048 | 0.014 | 0.014 | 0.014 | 0.014 |
1 | 4 | 128 | 2048 | 63.91 | 74.98 | 0.05 | 0.048 | 0.051 | 0.016 | 0.016 | 0.016 | 0.016 |
1 | 4 | 2048 | 128 | 52.13 | 75.07 | 0.367 | 0.366 | 0.368 | 0.016 | 0.016 | 0.016 | 0.017 |
1 | 4 | 2048 | 2048 | 60.90 | 75.07 | 0.369 | 0.368 | 0.372 | 0.016 | 0.016 | 0.016 | 0.016 |
16 | 4 | 1 | 128 | 959.05 | 75.01 | 0.034 | 0.021 | 0.048 | 0.016 | 0.017 | 0.018 | 0.018 |
16 | 4 | 128 | 128 | 796.94 | 75.07 | 0.312 | 0.05 | 0.435 | 0.017 | 0.017 | 0.018 | 0.019 |
16 | 4 | 128 | 2048 | 832.31 | 75.07 | 0.245 | 0.051 | 0.441 | 0.019 | 0.02 | 0.022 | 0.023 |
16 | 4 | 2048 | 128 | 240.39 | 75.70 | 3.965 | 0.372 | 5.618 | 0.022 | 0.023 | 0.023 | 0.025 |
16 | 4 | 2048 | 2048 | 617.35 | 75.71 | 3.428 | 0.372 | 5.703 | 0.023 | 0.024 | 0.025 | 0.026 |
32 | 4 | 1 | 128 | 1502.71 | 75.04 | 0.042 | 0.028 | 0.065 | 0.021 | 0.022 | 0.023 | 0.025 |
32 | 4 | 128 | 128 | 1162.02 | 75.20 | 0.493 | 0.065 | 0.775 | 0.021 | 0.022 | 0.024 | 0.052 |
32 | 4 | 128 | 2048 | 1249.91 | 75.20 | 0.486 | 0.062 | 0.771 | 0.025 | 0.027 | 0.03 | 0.031 |
32 | 4 | 2048 | 128 | 270.66 | 75.78 | 5.204 | 0.373 | 11.228 | 0.029 | 0.03 | 0.032 | 2.545 |
32 | 4 | 2048 | 2048 | 831.20 | 75.78 | 5.216 | 0.374 | 11.302 | 0.033 | 0.035 | 0.037 | 0.039 |
64 | 4 | 1 | 128 | 2063.85 | 75.10 | 0.072 | 0.032 | 0.238 | 0.03 | 0.032 | 0.035 | 0.038 |
64 | 4 | 128 | 128 | 1489.83 | 75.39 | 0.692 | 0.084 | 1.47 | 0.031 | 0.033 | 0.038 | 0.217 |
64 | 4 | 128 | 2048 | 1678.58 | 75.39 | 0.835 | 0.115 | 1.362 | 0.037 | 0.041 | 0.046 | 0.049 |
64 | 4 | 2048 | 128 | 287.97 | 75.79 | 6.458 | 0.444 | 22.085 | 0.044 | 0.047 | 0.405 | 2.864 |
64 | 4 | 2048 | 2048 | 1047.97 | 75.80 | 6.475 | 0.438 | 22.369 | 0.05 | 0.054 | 0.058 | 0.062 |
问下,这个静态 batch 怎么测试的?现在不是支持 continue batch 了,这个不是根据显存大小去看推理的 batch size 的吗?
这里静态batch是个相对概念。在推理过程中,还是 continuous batching,只是在推理的绝大部分时间中,推理batch和输入的batch一样(--concurrency参数)
ref https://github.com/vllm-project/vllm/tree/main/.buildkite/nightly-benchmarks
latest benchmark results https://buildkite.com/vllm/performance-benchmark/builds/3924
ref https://github.com/vllm-project/vllm/tree/main/.buildkite/nightly-benchmarks
Maybe we could do something similar cc @zhulinJulia24 @lvhan028