vllm
vllm copied to clipboard
why online seving slower than offline serving??
-
offline serving
-
online serving(fastapi)
log: INFO 12-11 21:50:36 llm_engine.py:649] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.4%, CPU KV cache usage: 0.0% INFO 12-11 21:50:41 async_llm_engine.py:111] Finished request 261ddff3312f44cd8ee1c52a6acd10e6.
Why is the speed 2 seconds slower when displayed as fastapi?? parameters is same, prompt is same
"Open-Orca/Mistral-7B-OpenOrca" this model same issue and any llama2 model same issue
python : 3.10.12 my library list.txt
cuda_version : 12.0 gpu: a100 40g my library list attached