[Performance]: the performance with chunked-prefill-enabled is lower than default
I tested vllm benchmark_throughput.py and finded that the performance with chunked-prefill-enabled is lower than default, how can I deal this problem
No response
Your current environment (if you think it is necessary)
export CUDA_VISIBLE_DEVICES=0
python3 ./benchmarks/benchmark_throughput.py \
--model /home/workspace/chatglm3-6b/ \
--tokenizer /home/workspace/chatglm3-6b/ \
--num-prompts 16 \
--input-len 1024 \
--output-len 256 \
--enable-chunked-prefill \
--trust-remote-code
Does the params set ok?
chunked_prefill_enable = False
INFO 09-01 12:46:11 async_llm_engine.py:268] 7cbe74f5c90c4a95954ae8b87d36a3c6 finished E2E: 0.29664182662963867, TTFT: 0.29621362686157227, TBT: 0.00042819976806640625, TIQ: 0.001392364501953125 INFO 09-01 12:46:15 async_llm_engine.py:268] 9bbc02b5dc904963a915612fc8951d0a finished E2E: 0.29630255699157715, TTFT: 0.2959132194519043, TBT: 0.00038933753967285156, TIQ: 0.0011632442474365234
chunked_prefill_enable = True INFO 09-01 12:52:55 async_llm_engine.py:268] f4ce2ce1237146b79df1e698d6d70582 finished E2E: 0.3303070068359375, TTFT: 0.32995128631591797, TBT: 0.00035572052001953125, TIQ: 0.0012929439544677734 INFO 09-01 12:53:00 async_llm_engine.py:268] b03a99b525da4bfd8ef6ef1928030a6b finished E2E: 0.3486812114715576, TTFT: 0.3483591079711914, TBT: 0.00032210350036621094, TIQ: 0.0012426376342773438
when enable the chunked prefill, TTFT 296ms -> 330ms
me too!
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!