vllm why online seving slower than offline serving??

offline serving
online serving(fastapi) log: INFO 12-11 21:50:36 llm_engine.py:649] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.4%, CPU KV cache usage: 0.0% INFO 12-11 21:50:41 async_llm_engine.py:111] Finished request 261ddff3312f44cd8ee1c52a6acd10e6.

Why is the speed 2 seconds slower when displayed as fastapi?? parameters is same, prompt is same

"Open-Orca/Mistral-7B-OpenOrca" this model same issue and any llama2 model same issue

cuda_version : 12.0 gpu: a100 40g my library list attached

Dec 11 '23 12:12 BangDaeng

@irasin Hello, About https://github.com/vllm-project/vllm/issues/2257#issuecomment-1869400614, Through my testing, In my latest test, when using AsyncLLMEngine, I observed significant fluctuations in GPT-Util (0-100%), but the throughput was high. Previously, when using LLMEngine with bs=1, the utilization was stable between 80-90%. What are your thoughts on this?

I am running Llama 70b on 8*A800 80G, and in both scenarios, the Memory Usage is approximately at 74.72GB (gpu_memory_utilization=90%). I'm also curious about the reasons behind such high memory consumption.

Dec 28 '23 13:12 Lvjinhong

Same issue here, online inference is almost half as fast as offline inference.

Feb 02 '24 11:02 SardarArslan

Hello @irasin, is there some new thoughts on this issue? I encounter the same thing, the speed is ~0.49 of the offline batch in tps. Much appreciated for any suggestions!

Apr 10 '24 10:04 iamhappytoo

Hello @irasin, is there some new thoughts on this issue? I encounter the same thing, the speed is ~0.49 of the offline batch in tps. Much appreciated for any suggestions!

I have observed the same issue

Apr 14 '24 01:04 rbgo404

+1 have observed this also, currently just living with it.

Apr 15 '24 10:04 SamComber

I think it's slower due to internet latency.

On Mon, 15 Apr 2024, 15:48 Sam Comber, @.***> wrote:

+1 have observed this also, currently just living with it.

— Reply to this email directly, view it on GitHub https://github.com/vllm-project/vllm/issues/2019#issuecomment-2056529923, or unsubscribe https://github.com/notifications/unsubscribe-auth/ATYE26B7Q2J3HOMDG6AQZNTY5OV57AVCNFSM6AAAAABAPWXDNWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJWGUZDSOJSGM . You are receiving this because you commented.Message ID: @.***>

Apr 15 '24 16:04 SardarArslan

I think it's slower due to internet latency. … On Mon, 15 Apr 2024, 15:48 Sam Comber, @.> wrote: +1 have observed this also, currently just living with it. — Reply to this email directly, view it on GitHub <#2019 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ATYE26B7Q2J3HOMDG6AQZNTY5OV57AVCNFSM6AAAAABAPWXDNWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJWGUZDSOJSGM . You are receiving this because you commented.Message ID: @.>

Have you done any benchmark on this?

Apr 15 '24 20:04 rbgo404

Confused +1