背景

我们发现绝大部分LLM推理引擎在报告推理性能的时候，都是关掉sampling功能的。但是在实际应用中，sampling几乎是必选项。为了给出尽可能贴近实际应用的benchmark，我们开了这个issue，报告 LMDeploy 在采样开启时候的性能。

测试模型

llama2-7b
llama2-13b
internlm-20b
llama2-70b

测试设备

A100 模型计算精度：BF16（FP16)、W4A16、KV8
V100 模型计算精度：FP16
4090 模型计算精度：W4A16
3090 模型计算精度：W4A16
2080 模型计算精度：W4A16

测量指标

静态推理性能（out token/s）：在固定batch、输入输出 token 数的前提下，每秒产生的token数量
每秒处理请求数量（request/s）：SharedGPT对话数据，不定长的 prompt 和 response。我们会测试 2 种接口：一种是 api_server 的 RESTful API，一种是 localhost 上的 Python API

Dec 11 '23 05:12 lvhan028

采样(num_beam=1)感觉是不是对性能影响不大啊？

Dec 12 '23 04:12 frankxyy

采样(num_beam=1)感觉是不是对性能影响不大啊？

我理解是 temperature, top_p, top_k 这样的setting

Dec 12 '23 05:12 lvhan028

采样(num_beam=1)感觉是不是对性能影响不大啊？

我理解是 temperature, top_p, top_k 这样的setting

我使用了不同的top_p, top_k和temperature在llama-2-chat-7b模型tp1下使用profile_throughtput.py测试了性能，tokens/s几乎没有差异

Dec 15 '23 05:12 zhulinJulia24

A100 (w4a16)

Request Throughput (RPM)

model	batch	tp	num_promts	RPS	RPM	FTL(ave)(s)	FTL(min)(s)	FTL(max)(s)	50%(s)	75%(s)	95%(s)	99%(s)	throughput(out tok/s)	throughput(total tok/s)
llama-7b	64	1	3000	12.083	725.005	0.199	0.027	2.393	0.008	0.022	0.052	0.339	2811.948	5795.166
	128	1	3000	13.375	802.511	0.341	0.052	4.029	0.022	0.046	0.098	0.380	3112.555	6414.690
llama2-13b	64	1	3000	7.980	478.805	0.130	0.036	2.077	0.026	0.031	0.086	0.138	1857.054	3827.217
	128	1	3000	8.370	502.200	0.385	0.069	4.405	0.051	0.071	0.146	0.212	1947.793	4014.223
internlm-20b	64	1	3000	6.333	379.977	0.241	0.055	10.015	0.038	0.046	0.128	0.188	1263.609	2674.010
	128	1	3000	6.310	378.589	2.236	0.083	9.626	0.067	0.094	0.204	0.289	1258.992	2664.239
llama2-70b	64	4	3000	5.355	321.290	0.245	0.063	3.595	0.036	0.041	0.129	0.213	1246.131	2568.162
	128	4	3000	6.484	389.064	0.455	0.078	6.471	0.058	0.075	0.196	0.280	1508.993	3109.897

Static Inference Performance

llama2-7b

batch	tp	prompt_tokens	completion_tokens	throughput(out tok/s)	mem(GB)	FTL(ave)(s)	FTL(min)(s)	FTL(max)(s)	50%(s)	75%(s)	95%(s)	99%(s)
1	1	1	128	260.80	67.77	0.004	0.004	0.005	0.004	0.004	0.004	0.004
1	1	128	128	245.91	67.77	0.013	0.012	0.014	0.004	0.004	0.004	0.005
1	1	128	2048	226.59	67.77	0.013	0.013	0.013	0.005	0.005	0.005	0.005
1	1	2048	128	159.96	67.99	0.196	0.13	0.516	0.005	0.005	0.005	0.005
1	1	2048	2048	197.86	67.99	0.131	0.13	0.132	0.005	0.005	0.005	0.005
16	1	1	128	3326.22	67.80	0.01	0.007	0.014	0.005	0.005	0.006	0.006
16	1	128	128	2491.98	67.99	0.108	0.012	0.145	0.005	0.006	0.006	0.008
16	1	128	2048	1583.80	67.99	0.1	0.015	0.144	0.01	0.013	0.015	0.016
16	1	2048	128	518.54	69.46	1.43	0.133	2.032	0.015	0.015	0.016	0.017
16	1	2048	2048	784.66	69.36	1.437	0.134	2.044	0.019	0.022	0.024	0.025
32	1	1	128	4841.70	67.83	0.014	0.008	0.025	0.006	0.007	0.008	0.011
32	1	128	128	3288.00	68.18	0.193	0.018	0.263	0.008	0.008	0.01	0.011
32	1	128	2048	1867.68	68.15	0.194	0.019	0.277	0.017	0.022	0.026	0.028
32	1	2048	128	548.20	69.49	1.878	0.134	4.079	0.027	0.028	0.029	0.912
32	1	2048	2048	837.42	69.49	1.807	0.132	4.083	0.036	0.041	0.045	0.047
64	1	1	128	6576.58	67.90	0.031	0.009	0.056	0.01	0.016	0.024	0.03
64	1	128	128	4098.99	68.52	0.377	0.015	0.531	0.013	0.018	0.027	0.037
64	1	128	2048	2093.60	69.11	0.417	0.02	0.737	0.029	0.038	0.046	0.049
64	1	2048	128	568.93	69.49	2.811	0.133	13.776	0.044	0.046	0.177	1.046
64	1	2048	2048	828.56	69.49	34.994	0.133	104.059	0.044	0.045	0.047	0.051

llama2-13b

batch	tp	prompt_tokens	completion_tokens	throughput(out tok/s)	mem(GB)	FTL(ave)(s)	FTL(min)(s)	FTL(max)(s)	50%(s)	75%(s)	95%(s)	99%(s)
1	1	1	128	157.79	57.66	0.007	0.007	0.008	0.006	0.006	0.006	0.007
1	1	128	128	151.50	61.63	0.021	0.021	0.023	0.006	0.006	0.007	0.007
1	1	128	2048	140.05	59.16	0.022	0.021	0.022	0.007	0.007	0.008	0.008
1	1	2048	128	105.74	57.91	0.238	0.237	0.24	0.008	0.008	0.008	0.008
1	1	2048	2048	122.68	57.91	0.238	0.237	0.239	0.008	0.008	0.008	0.008
16	1	1	128	2051.60	57.66	0.015	0.01	0.025	0.008	0.008	0.009	0.009
16	1	128	128	1493.19	57.91	0.224	0.022	0.264	0.009	0.009	0.01	0.011
16	1	128	2048	999.76	57.91	0.198	0.022	0.281	0.016	0.02	0.023	0.024
16	1	2048	128	301.19	59.72	2.704	0.239	3.829	0.023	0.023	0.024	0.025
16	1	2048	2048	489.79	59.72	2.478	0.241	3.849	0.03	0.034	0.036	0.037
32	1	1	128	2993.08	57.69	0.02	0.013	0.031	0.01	0.011	0.013	0.014
32	1	128	128	1996.37	58.16	0.42	0.022	0.505	0.012	0.013	0.015	0.017
32	1	128	2048	1165.21	58.56	0.729	0.022	1.176	0.026	0.033	0.038	0.04
32	1	2048	128	310.99	59.78	3.512	0.24	12.731	0.038	0.039	0.041	1.004
32	1	2048	2048	478.93	60.82	32.547	0.235	90.296	0.037	0.038	0.04	0.041
64	1	1	128	4229.19	57.78	0.038	0.01	0.065	0.015	0.018	0.026	0.032
64	1	128	128	2500.53	58.53	0.684	0.029	0.967	0.018	0.02	0.024	0.038
64	1	128	2048	1182.01	59.59	6.725	0.028	52.618	0.038	0.041	0.044	0.054
64	1	2048	128	312.75	59.72	15.559	0.241	25.265	0.038	0.039	0.041	1.701
64	1	2048	2048	471.09	97.87	158.007	0.239	255.386	0.038	0.038	0.04	0.042

internlm-20b

batch	tp	prompt_tokens	completion_tokens	throughput(out tok/s)	mem(GB)	FTL(ave)(s)	FTL(min)(s)	FTL(max)(s)	50%(s)	75%(s)	95%(s)	99%(s)
1	1	1	128	102.44	70.05	0.011	0.01	0.011	0.01	0.01	0.01	0.011
1	1	128	128	98.88	92.22	0.032	0.032	0.033	0.01	0.01	0.01	0.011
1	1	128	2048	91.28	342.14	0.032	0.032	0.033	0.011	0.011	0.012	0.012
1	1	2048	128	69.28	69.81	0.361	0.36	0.361	0.012	0.012	0.012	0.012
1	1	2048	2048	80.07	69.81	0.362	0.361	0.363	0.012	0.013	0.013	0.013
16	1	1	128	1330.03	69.63	0.021	0.011	0.03	0.012	0.012	0.013	0.014
16	1	128	128	979.30	69.84	0.33	0.032	0.399	0.013	0.014	0.015	0.016
16	1	128	2048	659.21	69.97	0.344	0.032	0.409	0.024	0.03	0.034	0.036
16	1	2048	128	199.12	73.31	4.307	0.364	5.812	0.035	0.035	0.036	0.037
16	1	2048	2048	308.87	73.47	5.686	0.363	42.356	0.042	0.044	0.045	0.046
32	1	1	128	1974.15	69.69	0.028	0.016	0.041	0.016	0.017	0.019	0.021
32	1	128	128	1309.96	70.13	0.559	0.035	0.771	0.018	0.02	0.022	0.026
32	1	128	2048	738.76	368.22	2.114	0.033	26.537	0.037	0.045	0.048	0.049
32	1	2048	128	200.29	73.59	10.016	0.363	17.883	0.046	0.047	0.049	0.429
32	1	2048	2048	306.08	73.56	88.279	0.362	173.383	0.044	0.045	0.047	0.05
64	1	1	128	2808.92	69.84	0.041	0.014	0.06	0.022	0.024	0.028	0.03
64	1	128	128	1651.45	70.38	1.082	0.04	1.479	0.027	0.029	0.033	0.037
64	1	128	2048	736.56	205.43	22.127	0.035	83.859	0.048	0.05	0.053	0.273
64	1	2048	128	199.68	73.88	29.365	0.359	36.276	0.047	0.047	0.049	0.427
64	1	2048	2048	305.56	73.81	283.211	0.362	391.207	0.044	0.045	0.047	0.048

llama2-70b

batch	tp	prompt_tokens	completion_tokens	throughput(out tok/s)	mem(GB)	FTL(ave)(s)	FTL(min)(s)	FTL(max)(s)	50%(s)	75%(s)	95%(s)	99%(s)
1	4	1	128	72.79	74.98	0.016	0.014	0.017	0.014	0.014	0.014	0.015
1	4	128	128	70.26	74.98	0.047	0.047	0.048	0.014	0.014	0.014	0.014
1	4	128	2048	63.91	74.98	0.05	0.048	0.051	0.016	0.016	0.016	0.016
1	4	2048	128	52.13	75.07	0.367	0.366	0.368	0.016	0.016	0.016	0.017
1	4	2048	2048	60.90	75.07	0.369	0.368	0.372	0.016	0.016	0.016	0.016
16	4	1	128	959.05	75.01	0.034	0.021	0.048	0.016	0.017	0.018	0.018
16	4	128	128	796.94	75.07	0.312	0.05	0.435	0.017	0.017	0.018	0.019
16	4	128	2048	832.31	75.07	0.245	0.051	0.441	0.019	0.02	0.022	0.023
16	4	2048	128	240.39	75.70	3.965	0.372	5.618	0.022	0.023	0.023	0.025
16	4	2048	2048	617.35	75.71	3.428	0.372	5.703	0.023	0.024	0.025	0.026
32	4	1	128	1502.71	75.04	0.042	0.028	0.065	0.021	0.022	0.023	0.025
32	4	128	128	1162.02	75.20	0.493	0.065	0.775	0.021	0.022	0.024	0.052
32	4	128	2048	1249.91	75.20	0.486	0.062	0.771	0.025	0.027	0.03	0.031
32	4	2048	128	270.66	75.78	5.204	0.373	11.228	0.029	0.03	0.032	2.545
32	4	2048	2048	831.20	75.78	5.216	0.374	11.302	0.033	0.035	0.037	0.039
64	4	1	128	2063.85	75.10	0.072	0.032	0.238	0.03	0.032	0.035	0.038
64	4	128	128	1489.83	75.39	0.692	0.084	1.47	0.031	0.033	0.038	0.217
64	4	128	2048	1678.58	75.39	0.835	0.115	1.362	0.037	0.041	0.046	0.049
64	4	2048	128	287.97	75.79	6.458	0.444	22.085	0.044	0.047	0.405	2.864
64	4	2048	2048	1047.97	75.80	6.475	0.438	22.369	0.05	0.054	0.058	0.062

Dec 21 '23 03:12 lvhan028

问下，这个静态 batch 怎么测试的？现在不是支持 continue batch 了，这个不是根据显存大小去看推理的 batch size 的吗？

Feb 06 '24 03:02 Ajay-Wong

这里静态batch是个相对概念。在推理过程中，还是 continuous batching，只是在推理的绝大部分时间中，推理batch和输入的batch一样（--concurrency参数）

Feb 17 '24 06:02 lvhan028

ref https://github.com/vllm-project/vllm/tree/main/.buildkite/nightly-benchmarks

Jul 13 '24 07:07 zhyncs

latest benchmark results https://buildkite.com/vllm/performance-benchmark/builds/3924

Jul 13 '24 07:07 zhyncs

ref https://github.com/vllm-project/vllm/tree/main/.buildkite/nightly-benchmarks

Maybe we could do something similar cc @zhulinJulia24 @lvhan028

Jul 13 '24 07:07 zhyncs

lmdeploy
lmdeploy copied to clipboard

[Benchmark] benchmarks on different cuda architecture with models of various size

背景

测试模型

测试设备

测量指标

A100 (w4a16)

Request Throughput (RPM)

Static Inference Performance

llama2-7b

llama2-13b

internlm-20b

llama2-70b

lmdeploy lmdeploy copied to clipboard

[Benchmark] benchmarks on different cuda architecture with models of various size

背景

测试模型

测试设备

测量指标

A100 (w4a16)

Request Throughput (RPM)

Static Inference Performance

llama2-7b

llama2-13b

internlm-20b

llama2-70b

lmdeploy
lmdeploy copied to clipboard