sglang TTFT latency for long context (16K) is very high around 15 seconds for llama3.1 70b model. (same or worse than vLLM)

Checklist

[X] 1. I have searched related issues but cannot get the expected help.
[X] 2. The bug has not been fixed in the latest version.
[ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

I am experimenting with SGLang and vLLM for long context(16K) RAG application which requires real time responses. I am using single Nvidia A6000 48GB GPU and llaam3.1 70b awq 4 bit model.

Currently I am seeing Time for first token latency is around 15 seconds which is very high. Experimented with parameters like --chunked-prefill-size , --mem-frac etc

can you please suggest what are the parameters I need to mainly focus on to get the optimal TTFT for long context ?

Reproduction

na

Environment

na

Aug 04 '24 23:08 gkiri

SGLang currently mostly accelerates the decoding, so what you observed is expected. We are working on multiple optimizations that can accelerate prefill for long context workloads. Some of them should be ready soon. We will let you know when it is ready!

Aug 05 '24 03:08 Ying1123

Hi @gkiri May you provide a script for me to reproduce the situation you mentioned? Thanks.

Aug 05 '24 04:08 zhyncs

python3 -m sglang.launch_server --model-path hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 --dtype half --trust-remote-code --quantization marlin --enable-p2p-check --efficient-weight-load --host 0.0.0.0 --mem-fraction-static 0.875 --disable-cuda-graph --max-running-requests 5 --port 30000 --context-length 16000

Aug 05 '24 13:08 gkiri

Please provide Input query length >10K tokens to observe high TTFT latency

Aug 05 '24 13:08 gkiri

@gkiri Thanks! May you provide a simple client demo?

Aug 05 '24 17:08 zhyncs

@gkiri can you also provide the vllm command line for your test?

Aug 05 '24 19:08 min-xu-et

python -m vllm.entrypoints.openai.api_server --model hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 --max_model_len 16000 --gpu-memory-utilization 0.98 --dtype=half --enforce-eager --quantization awq --swap-space 4 --disable-log-requests --trust-remote-code --enable-prefix-caching --use-v2-block-manager

Aug 06 '24 12:08 gkiri

We are working on multiple optimizations that can accelerate prefill for long context workloads.

@Ying1123 could you share a bit on how you intend to improve the prefill or even chunked prefill performance?

Is one for instance by using FA3?

Aug 08 '24 08:08 jon-chuang

We will not share many details for now. I suggest you switch the LLM inference engine to SGLang and stay tuned.

Aug 08 '24 08:08 zhyncs

SGLang currently mostly accelerates the decoding, so what you observed is expected. We are working on multiple optimizations that can accelerate prefill for long context workloads. Some of them should be ready soon. We will let you know when it is ready!

I was reading the paper and wanted to dig deeper to the roots of performance superiority of SGLang to vLLM. The paper (In section 6.2) mentions that the prefill time is expected to improve. Or maybe I'm missing something here. Could you please elaborate a little bit?

Aug 10 '24 00:08 mory91

This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.

Oct 09 '24 01:10 github-actions[bot]

TTFT latency for long context(10k) ,batch_size=8, mean TTFT: 3496.12ms； sglang 0.3.5 GPU is Radeon RX 7900xtx batch_size=16， mean TTFT: 12495.6ms

Nov 07 '24 06:11 linqingxu