[Bug] Why sglang is slower than vllm on ShareGPT datasets?
Checklist
- [X] 1. I have searched related issues but cannot get the expected help.
- [X] 2. The bug has not been fixed in the latest version.
- [X] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [X] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [X] 5. Please use English, otherwise it will be closed.
Describe the bug
I compared the performance of VLLM and SGLang, and found that VLLM outperforms SGLang slightly. Adding "--disable-radix-cache" or not will not make a big difference on results.
Reproduction
[Scripts for serve LLM] subprocess.run(["python3","-m", "vllm.entrypoints.openai.api_server", "--model", "/workspace/LLM-Research/gemma-2-2b-it/", "--port", "8081", "--enable-prefix-caching" ])
subprocess.run(["python3","-m", "sglang.launch_server", "--model", "/workspace/LLM-Research/gemma-2-2b-it/", "--disable-radix-cache" ])
[Scripts for benchmark]
python3 -m sglang.bench_serving --backend vllm--dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompt 300 --request-rate-range 1,2,4,8,16,32 --random-input 1024 --random-output 1024 --multi > vllm_log_gemma_2b
python3 -m sglang.bench_serving --backend sglang --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompt 300 --request-rate-range 1,2,4,8,16,32 --random-input 1024 --random-output 1024 --multi > sglang_log_gemma_2b
Environment
Python: 3.10.14 (main, Apr 6 2024, 18:45:05) [GCC 9.4.0] CUDA available: True GPU 0: NVIDIA GeForce RTX 4090 GPU 0 Compute Capability: 8.9 CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 12.1, V12.1.105 CUDA Driver Version: 535.171.04 PyTorch: 2.4.0+cu121 flashinfer: 0.1.5+cu121torch2.4 triton: 3.0.0 transformers: 4.44.2 requests: 2.32.3 tqdm: 4.66.5 numpy: 1.26.4 aiohttp: 3.10.5 fastapi: 0.112.2 hf_transfer: 0.1.8 huggingface_hub: 0.24.6 interegular: 0.3.3 packaging: 24.1 PIL: 10.4.0 psutil: 6.0.0 pydantic: 2.8.2 uvicorn: 0.30.6 uvloop: 0.20.0 zmq: 26.2.0 vllm: 0.5.5 multipart: 0.0.9 openai: 1.42.0 anthropic: 0.34.1 NVIDIA Topology: GPU0 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X 0-31 0 N/A
Legend:
X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks
ulimit soft: 1048576
Why vLLM --enable-prefix-caching but SGLang --disable-radix-cache?
Why vLLM
--enable-prefix-cachingbut SGLang--disable-radix-cache?
I added it because I noticed the comment "Disable RadixAttention for prefix caching." I experimented with both options, but the performance remained nearly the same.
# H100 SXM 80G
# both disable prefix cache
{"backend": "sglang", "dataset_name": "sharegpt", "request_rate": 1, "total_input_tokens": 74403, "total_output_tokens": 61393, "total_output_tokens_retokenized": 53379, "mean_e2e_latency_ms": 1060.0129053120813, "median_e2e_latency_ms": 631.0748402029276, "median_ttft_ms": 18.003160133957863, "median_itl_ms": 4.990246146917343, "output_throughput": 197.87412207948074, "sharegpt_output_len": null, "random_input_len": 1024, "random_output_len": 1024, "random_range_ratio": 0.0, "duration": 310.26290529966354, "completed": 300}
{"backend": "sglang", "dataset_name": "sharegpt", "request_rate": 2, "total_input_tokens": 74403, "total_output_tokens": 61393, "total_output_tokens_retokenized": 54385, "mean_e2e_latency_ms": 1142.5293239826956, "median_e2e_latency_ms": 682.4049800634384, "median_ttft_ms": 17.289772629737854, "median_itl_ms": 5.3678154945373535, "output_throughput": 402.8349289397124, "sharegpt_output_len": null, "random_input_len": 1024, "random_output_len": 1024, "random_range_ratio": 0.0, "duration": 152.4023752398789, "completed": 300}
{"backend": "sglang", "dataset_name": "sharegpt", "request_rate": 4, "total_input_tokens": 74403, "total_output_tokens": 61393, "total_output_tokens_retokenized": 53874, "mean_e2e_latency_ms": 1255.643160852293, "median_e2e_latency_ms": 766.4705812931061, "median_ttft_ms": 16.696477308869362, "median_itl_ms": 5.6404247879981995, "output_throughput": 801.4205244685216, "sharegpt_output_len": null, "random_input_len": 1024, "random_output_len": 1024, "random_range_ratio": 0.0, "duration": 76.60522550344467, "completed": 300}
{"backend": "sglang", "dataset_name": "sharegpt", "request_rate": 8, "total_input_tokens": 74403, "total_output_tokens": 61393, "total_output_tokens_retokenized": 54300, "mean_e2e_latency_ms": 1492.2475426395733, "median_e2e_latency_ms": 920.020530000329, "median_ttft_ms": 16.93708449602127, "median_itl_ms":6.66210800409317, "output_throughput": 1489.3958356616095, "sharegpt_output_len": null, "random_input_len": 1024, "random_output_len": 1024, "random_range_ratio": 0.0, "duration": 41.22006959468126, "completed": 300}
{"backend": "sglang", "dataset_name": "sharegpt", "request_rate": 16, "total_input_tokens": 74403, "total_output_tokens": 61393, "total_output_tokens_retokenized": 53956, "mean_e2e_latency_ms": 1968.9715463047226, "median_e2e_latency_ms": 1235.9060738235712, "median_ttft_ms": 19.653500989079475, "median_itl_ms": 8.145060390233994, "output_throughput": 2431.3891651449953, "sharegpt_output_len": null, "random_input_len": 1024, "random_output_len": 1024, "random_range_ratio": 0.0, "duration": 25.250174213200808, "completed": 300}
{"backend": "sglang", "dataset_name": "sharegpt", "request_rate": 32, "total_input_tokens": 74403, "total_output_tokens": 61393, "total_output_tokens_retokenized": 53773, "mean_e2e_latency_ms": 3084.684201118847, "median_e2e_latency_ms": 1871.8808554112911, "median_ttft_ms": 25.87439864873886, "median_itl_ms": 13.38309422135353, "output_throughput": 3353.5147760573177, "sharegpt_output_len": null, "random_input_len": 1024, "random_output_len": 1024, "random_range_ratio": 0.0, "duration": 18.30706112831831, "completed": 300}
{"backend": "vllm", "dataset_name": "sharegpt", "request_rate": 1, "total_input_tokens": 74403, "total_output_tokens": 61393, "total_output_tokens_retokenized": 54931, "mean_e2e_latency_ms": 1175.5820131177704, "median_e2e_latency_ms": 700.7365971803665, "median_ttft_ms": 30.978860333561897, "median_itl_ms":5.363507196307182, "output_throughput": 197.8353235443718, "sharegpt_output_len": null, "random_input_len": 1024, "random_output_len": 1024, "random_range_ratio": 0.0, "duration": 310.32375260442495, "completed": 300}
{"backend": "vllm", "dataset_name": "sharegpt", "request_rate": 2, "total_input_tokens": 74403, "total_output_tokens": 61393, "total_output_tokens_retokenized": 53702, "mean_e2e_latency_ms": 1317.124318704009, "median_e2e_latency_ms": 778.35145406425, "median_ttft_ms": 28.458086773753166, "median_itl_ms": 5.9411972761154175, "output_throughput": 401.2044547893357, "sharegpt_output_len": null, "random_input_len": 1024, "random_output_len": 1024, "random_range_ratio": 0.0, "duration": 153.02173060923815, "completed": 300}
{"backend": "vllm", "dataset_name": "sharegpt", "request_rate": 4, "total_input_tokens": 74403, "total_output_tokens": 61393, "total_output_tokens_retokenized": 54104, "mean_e2e_latency_ms": 1572.6086960608761, "median_e2e_latency_ms": 912.4196767807007, "median_ttft_ms": 26.888899505138397, "median_itl_ms":6.863143295049667, "output_throughput": 799.936574330024, "sharegpt_output_len": null, "random_input_len": 1024, "random_output_len": 1024, "random_range_ratio": 0.0, "duration": 76.7473346889019, "completed": 300}
{"backend": "vllm", "dataset_name": "sharegpt", "request_rate": 8, "total_input_tokens": 74403, "total_output_tokens": 61393, "total_output_tokens_retokenized": 54025, "mean_e2e_latency_ms": 2128.538253928224, "median_e2e_latency_ms": 1306.1794489622116, "median_ttft_ms": 29.123559594154358, "median_itl_ms":8.62162932753563, "output_throughput": 1432.597216274242, "sharegpt_output_len": null, "random_input_len": 1024, "random_output_len": 1024, "random_range_ratio": 0.0, "duration": 42.85433428362012, "completed": 300}
{"backend": "vllm", "dataset_name": "sharegpt", "request_rate": 16, "total_input_tokens": 74403, "total_output_tokens": 61393, "total_output_tokens_retokenized": 53782, "mean_e2e_latency_ms": 5813.860850247244, "median_e2e_latency_ms": 3644.37235891819, "median_ttft_ms": 42.83914528787136, "median_itl_ms": 27.088146656751633, "output_throughput": 1840.3513998881972, "sharegpt_output_len": null, "random_input_len": 1024, "random_output_len": 1024, "random_range_ratio": 0.0, "duration": 33.35938995331526, "completed": 300}
{"backend": "vllm", "dataset_name": "sharegpt", "request_rate": 32, "total_input_tokens": 74403, "total_output_tokens": 61393, "total_output_tokens_retokenized": 53791, "mean_e2e_latency_ms": 9850.973141404489, "median_e2e_latency_ms": 8765.44738188386, "median_ttft_ms": 54.08148281276226, "median_itl_ms": 42.32705757021904, "output_throughput": 1876.0618597534879, "sharegpt_output_len": null, "random_input_len": 1024, "random_output_len": 1024, "random_range_ratio": 0.0, "duration": 32.72440068051219, "completed": 300}
This is the result of H100, SGLang leads comprehensively over vLLM.
And I will reproduce it and check the reason on 4090.
# H100 SXM 80G # both disable prefix cache {"backend": "sglang", "dataset_name": "sharegpt", "request_rate": 1, "total_input_tokens": 74403, "total_output_tokens": 61393, "total_output_tokens_retokenized": 53379, "mean_e2e_latency_ms": 1060.0129053120813, "median_e2e_latency_ms": 631.0748402029276, "median_ttft_ms": 18.003160133957863, "median_itl_ms": 4.990246146917343, "output_throughput": 197.87412207948074, "sharegpt_output_len": null, "random_input_len": 1024, "random_output_len": 1024, "random_range_ratio": 0.0, "duration": 310.26290529966354, "completed": 300} {"backend": "sglang", "dataset_name": "sharegpt", "request_rate": 2, "total_input_tokens": 74403, "total_output_tokens": 61393, "total_output_tokens_retokenized": 54385, "mean_e2e_latency_ms": 1142.5293239826956, "median_e2e_latency_ms": 682.4049800634384, "median_ttft_ms": 17.289772629737854, "median_itl_ms": 5.3678154945373535, "output_throughput": 402.8349289397124, "sharegpt_output_len": null, "random_input_len": 1024, "random_output_len": 1024, "random_range_ratio": 0.0, "duration": 152.4023752398789, "completed": 300} {"backend": "sglang", "dataset_name": "sharegpt", "request_rate": 4, "total_input_tokens": 74403, "total_output_tokens": 61393, "total_output_tokens_retokenized": 53874, "mean_e2e_latency_ms": 1255.643160852293, "median_e2e_latency_ms": 766.4705812931061, "median_ttft_ms": 16.696477308869362, "median_itl_ms": 5.6404247879981995, "output_throughput": 801.4205244685216, "sharegpt_output_len": null, "random_input_len": 1024, "random_output_len": 1024, "random_range_ratio": 0.0, "duration": 76.60522550344467, "completed": 300} {"backend": "sglang", "dataset_name": "sharegpt", "request_rate": 8, "total_input_tokens": 74403, "total_output_tokens": 61393, "total_output_tokens_retokenized": 54300, "mean_e2e_latency_ms": 1492.2475426395733, "median_e2e_latency_ms": 920.020530000329, "median_ttft_ms": 16.93708449602127, "median_itl_ms":6.66210800409317, "output_throughput": 1489.3958356616095, "sharegpt_output_len": null, "random_input_len": 1024, "random_output_len": 1024, "random_range_ratio": 0.0, "duration": 41.22006959468126, "completed": 300} {"backend": "sglang", "dataset_name": "sharegpt", "request_rate": 16, "total_input_tokens": 74403, "total_output_tokens": 61393, "total_output_tokens_retokenized": 53956, "mean_e2e_latency_ms": 1968.9715463047226, "median_e2e_latency_ms": 1235.9060738235712, "median_ttft_ms": 19.653500989079475, "median_itl_ms": 8.145060390233994, "output_throughput": 2431.3891651449953, "sharegpt_output_len": null, "random_input_len": 1024, "random_output_len": 1024, "random_range_ratio": 0.0, "duration": 25.250174213200808, "completed": 300} {"backend": "sglang", "dataset_name": "sharegpt", "request_rate": 32, "total_input_tokens": 74403, "total_output_tokens": 61393, "total_output_tokens_retokenized": 53773, "mean_e2e_latency_ms": 3084.684201118847, "median_e2e_latency_ms": 1871.8808554112911, "median_ttft_ms": 25.87439864873886, "median_itl_ms": 13.38309422135353, "output_throughput": 3353.5147760573177, "sharegpt_output_len": null, "random_input_len": 1024, "random_output_len": 1024, "random_range_ratio": 0.0, "duration": 18.30706112831831, "completed": 300} {"backend": "vllm", "dataset_name": "sharegpt", "request_rate": 1, "total_input_tokens": 74403, "total_output_tokens": 61393, "total_output_tokens_retokenized": 54931, "mean_e2e_latency_ms": 1175.5820131177704, "median_e2e_latency_ms": 700.7365971803665, "median_ttft_ms": 30.978860333561897, "median_itl_ms":5.363507196307182, "output_throughput": 197.8353235443718, "sharegpt_output_len": null, "random_input_len": 1024, "random_output_len": 1024, "random_range_ratio": 0.0, "duration": 310.32375260442495, "completed": 300} {"backend": "vllm", "dataset_name": "sharegpt", "request_rate": 2, "total_input_tokens": 74403, "total_output_tokens": 61393, "total_output_tokens_retokenized": 53702, "mean_e2e_latency_ms": 1317.124318704009, "median_e2e_latency_ms": 778.35145406425, "median_ttft_ms": 28.458086773753166, "median_itl_ms": 5.9411972761154175, "output_throughput": 401.2044547893357, "sharegpt_output_len": null, "random_input_len": 1024, "random_output_len": 1024, "random_range_ratio": 0.0, "duration": 153.02173060923815, "completed": 300} {"backend": "vllm", "dataset_name": "sharegpt", "request_rate": 4, "total_input_tokens": 74403, "total_output_tokens": 61393, "total_output_tokens_retokenized": 54104, "mean_e2e_latency_ms": 1572.6086960608761, "median_e2e_latency_ms": 912.4196767807007, "median_ttft_ms": 26.888899505138397, "median_itl_ms":6.863143295049667, "output_throughput": 799.936574330024, "sharegpt_output_len": null, "random_input_len": 1024, "random_output_len": 1024, "random_range_ratio": 0.0, "duration": 76.7473346889019, "completed": 300} {"backend": "vllm", "dataset_name": "sharegpt", "request_rate": 8, "total_input_tokens": 74403, "total_output_tokens": 61393, "total_output_tokens_retokenized": 54025, "mean_e2e_latency_ms": 2128.538253928224, "median_e2e_latency_ms": 1306.1794489622116, "median_ttft_ms": 29.123559594154358, "median_itl_ms":8.62162932753563, "output_throughput": 1432.597216274242, "sharegpt_output_len": null, "random_input_len": 1024, "random_output_len": 1024, "random_range_ratio": 0.0, "duration": 42.85433428362012, "completed": 300} {"backend": "vllm", "dataset_name": "sharegpt", "request_rate": 16, "total_input_tokens": 74403, "total_output_tokens": 61393, "total_output_tokens_retokenized": 53782, "mean_e2e_latency_ms": 5813.860850247244, "median_e2e_latency_ms": 3644.37235891819, "median_ttft_ms": 42.83914528787136, "median_itl_ms": 27.088146656751633, "output_throughput": 1840.3513998881972, "sharegpt_output_len": null, "random_input_len": 1024, "random_output_len": 1024, "random_range_ratio": 0.0, "duration": 33.35938995331526, "completed": 300} {"backend": "vllm", "dataset_name": "sharegpt", "request_rate": 32, "total_input_tokens": 74403, "total_output_tokens": 61393, "total_output_tokens_retokenized": 53791, "mean_e2e_latency_ms": 9850.973141404489, "median_e2e_latency_ms": 8765.44738188386, "median_ttft_ms": 54.08148281276226, "median_itl_ms": 42.32705757021904, "output_throughput": 1876.0618597534879, "sharegpt_output_len": null, "random_input_len": 1024, "random_output_len": 1024, "random_range_ratio": 0.0, "duration": 32.72440068051219, "completed": 300}This is the result of H100, SGLang leads comprehensively over vLLM.
And I will reproduce it and check the reason on 4090.
Thanks for reply! I am wondering whether it may be caused by the GPU memory settings caused it will throw out of memory error if you don't memory usage, but I am sure that both vLLM and SGLang are using about 20 GB of memory.
This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.