vllm icon indicating copy to clipboard operation
vllm copied to clipboard

why online seving slower than offline serving??

Open BangDaeng opened this issue 1 year ago • 9 comments

  1. offline serving image

  2. online serving(fastapi) image image log: INFO 12-11 21:50:36 llm_engine.py:649] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.4%, CPU KV cache usage: 0.0% INFO 12-11 21:50:41 async_llm_engine.py:111] Finished request 261ddff3312f44cd8ee1c52a6acd10e6.

Why is the speed 2 seconds slower when displayed as fastapi?? parameters is same, prompt is same

"Open-Orca/Mistral-7B-OpenOrca" this model same issue and any llama2 model same issue

python : 3.10.12 my library list.txt

cuda_version : 12.0 gpu: a100 40g my library list attached

BangDaeng avatar Dec 11 '23 12:12 BangDaeng