sglang
sglang copied to clipboard
[Bug] Llama3 70B A100 PCIE TP4 slow speed
Checklist
- [X] 1. I have searched related issues but cannot get the expected help.
- [X] 2. The bug has not been fixed in the latest version.
- [X] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [X] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [X] 5. Please use English, otherwise it will be closed.
Describe the bug
When using ShareGPT 1k, no results can be obtained.
10 is normal, but it keeps getting stuck after changing to 1000.
Initial test run completed. Starting main benchmark run...
0%| | 0/1000 [00:00<?, ?it/s]
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Successful requests: 10
Benchmark duration (s): 42.87
Total input tokens: 1369
Total generated tokens: 2278
Total generated tokens (retokenized): 2268
Request throughput (req/s): 0.23
Input token throughput (tok/s): 31.93
Output token throughput (tok/s): 53.14
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 16760.78
Median E2E Latency (ms): 11625.66
---------------Time to First Token----------------
Mean TTFT (ms): 4175.83
Median TTFT (ms): 4582.61
P99 TTFT (ms): 4774.11
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 66.01
Median TPOT (ms): 64.35
P99 TPOT (ms): 100.07
---------------Inter-token Latency----------------
Mean ITL (ms): 55.78
Median ITL (ms): 50.80
P99 ITL (ms): 106.39
==================================================
Reproduction
# server
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-70B-Instruct --tp 4 --disable-radix-cache --enable-p2p-check
# client
python -m sglang.bench_serving --backend sglang --num-prompts 10
python -m sglang.bench_serving --backend sglang --num-prompts 1000
Environment
Python: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]
CUDA available: True
GPU 0,1,2,3: NVIDIA A100 80GB PCIe
GPU 0,1,2,3 Compute Capability: 8.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.1, V12.1.105
CUDA Driver Version: 545.23.08
PyTorch: 2.4.0+cu121
sglang: 0.2.13
flashinfer: 0.1.5+cu121torch2.4
triton: 3.0.0
transformers: 4.44.0
requests: 2.32.3
tqdm: 4.66.5
numpy: 1.26.3
aiohttp: 3.10.3
fastapi: 0.112.1
hf_transfer: 0.1.8
huggingface_hub: 0.24.5
interegular: 0.3.3
packaging: 23.2
PIL: 10.2.0
psutil: 5.9.8
pydantic: 2.8.2
uvicorn: 0.30.6
uvloop: 0.20.0
zmq: 24.0.1
vllm: 0.5.4
multipart: 0.0.9
openai: 1.41.0
anthropic: 0.34.0
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X PHB PHB PHB 0-251 0 N/A
GPU1 PHB X PHB PHB 0-251 0 N/A
GPU2 PHB PHB X PHB 0-251 0 N/A
GPU3 PHB PHB PHB X 0-251 0 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
ulimit soft: 1048576
Update: It's just a bit slow, but it can still run. It's just that the speed is incredibly slow.
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Successful requests: 1000
Benchmark duration (s): 1472.60
Total input tokens: 215196
Total generated tokens: 198343
Total generated tokens (retokenized): 197285
Request throughput (req/s): 0.68
Input token throughput (tok/s): 146.13
Output token throughput (tok/s): 134.69
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 1086277.13
Median E2E Latency (ms): 1086581.10
---------------Time to First Token----------------
Mean TTFT (ms): 463420.11
Median TTFT (ms): 490054.93
P99 TTFT (ms): 763179.96
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 10909.16
Median TPOT (ms): 3128.24
P99 TPOT (ms): 120089.61
---------------Inter-token Latency----------------
Mean ITL (ms): 3173.50
Median ITL (ms): 1764.77
P99 ITL (ms): 3636.58
==================================================
It's possible that the server is using some kind of hypervisor, which causes the link between the gpus to be very slow, which can seriously affect performance. I'm in a similar situation when using multiple L40S+vllm on a server that uses KVM. May help.
@billvsme Thanks for your info. What parameters did you adjust to solve this problem?
Replace a machine that doesn't use KVM and the speed will be normal.
Note:when I change the machine, PHB -> SYS
Thanks!