sglang icon indicating copy to clipboard operation
sglang copied to clipboard

TTFT latency for long context (16K) is very high around 15 seconds for llama3.1 70b model. (same or worse than vLLM)

Open gkiri opened this issue 1 year ago • 10 comments

Checklist

  • [X] 1. I have searched related issues but cannot get the expected help.
  • [X] 2. The bug has not been fixed in the latest version.
  • [ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

I am experimenting with SGLang and vLLM for long context(16K) RAG application which requires real time responses. I am using single Nvidia A6000 48GB GPU and llaam3.1 70b awq 4 bit model.

Currently I am seeing Time for first token latency is around 15 seconds which is very high. Experimented with parameters like --chunked-prefill-size , --mem-frac etc

can you please suggest what are the parameters I need to mainly focus on to get the optimal TTFT for long context ?

Reproduction

na

Environment

na

gkiri avatar Aug 04 '24 23:08 gkiri

SGLang currently mostly accelerates the decoding, so what you observed is expected. We are working on multiple optimizations that can accelerate prefill for long context workloads. Some of them should be ready soon. We will let you know when it is ready!

Ying1123 avatar Aug 05 '24 03:08 Ying1123

Hi @gkiri May you provide a script for me to reproduce the situation you mentioned? Thanks.

zhyncs avatar Aug 05 '24 04:08 zhyncs

python3 -m sglang.launch_server --model-path hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 --dtype half --trust-remote-code --quantization marlin --enable-p2p-check --efficient-weight-load --host 0.0.0.0 --mem-fraction-static 0.875 --disable-cuda-graph --max-running-requests 5 --port 30000 --context-length 16000

gkiri avatar Aug 05 '24 13:08 gkiri

Please provide Input query length >10K tokens to observe high TTFT latency

gkiri avatar Aug 05 '24 13:08 gkiri

@gkiri Thanks! May you provide a simple client demo?

zhyncs avatar Aug 05 '24 17:08 zhyncs

@gkiri can you also provide the vllm command line for your test?

min-xu-et avatar Aug 05 '24 19:08 min-xu-et

python -m vllm.entrypoints.openai.api_server --model hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 --max_model_len 16000 --gpu-memory-utilization 0.98 --dtype=half --enforce-eager --quantization awq --swap-space 4 --disable-log-requests --trust-remote-code --enable-prefix-caching --use-v2-block-manager

gkiri avatar Aug 06 '24 12:08 gkiri

We are working on multiple optimizations that can accelerate prefill for long context workloads.

@Ying1123 could you share a bit on how you intend to improve the prefill or even chunked prefill performance?

Is one for instance by using FA3?

jon-chuang avatar Aug 08 '24 08:08 jon-chuang

We will not share many details for now. I suggest you switch the LLM inference engine to SGLang and stay tuned.

zhyncs avatar Aug 08 '24 08:08 zhyncs

SGLang currently mostly accelerates the decoding, so what you observed is expected. We are working on multiple optimizations that can accelerate prefill for long context workloads. Some of them should be ready soon. We will let you know when it is ready!

I was reading the paper and wanted to dig deeper to the roots of performance superiority of SGLang to vLLM. The paper (In section 6.2) mentions that the prefill time is expected to improve. Or maybe I'm missing something here. Could you please elaborate a little bit?

mory91 avatar Aug 10 '24 00:08 mory91

This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.

github-actions[bot] avatar Oct 09 '24 01:10 github-actions[bot]

TTFT latency for long context(10k) ,batch_size=8, mean TTFT: 3496.12ms; sglang 0.3.5 GPU is Radeon RX 7900xtx image batch_size=16, mean TTFT: 12495.6ms image

linqingxu avatar Nov 07 '24 06:11 linqingxu