[Bug]: inter-token latency is lower than TPOT in serving benchmark result
Your current environment
v0.5.2. vLLM env is not an issue so I will just skip the collection process
🐛 Describe the bug
I am running benchmark tests and notice one potential problem.
Seems the inter-token latency is lower than TPOT. Basically, inter-token latency takes TTFT into the consideration and should be higher than TPOT. However the data shows different result. I have not looked at the code yet and I will try to figure this out
root@fb5250e2ae4c:/workspace# python3 vllm/benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model meta-llama/Llama-2-7b-chat-hf --num-prompts 200 --endpoint /v1/completions --tokenizer meta-llama/Llama-2-7b-chat-hf --save-result 2>&1 | tee benchmark_serving.txt
Namespace(backend='vllm', base_url=None, host='localhost', port=8000, endpoint='/v1/completions', dataset=None, dataset_name='sharegpt', dataset_path='./ShareGPT_V3_unfiltered_cleaned_split.json', model='meta-llama/Llama-2-7b-chat-hf', tokenizer='meta-llama/Llama-2-7b-chat-hf', best_of=1, use_beam_search=False, num_prompts=200, sharegpt_output_len=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, random_input_len=1024, random_output_len=128, random_range_ratio=1.0, request_rate=inf, seed=0, trust_remote_code=False, disable_tqdm=False, save_result=True, metadata=None, result_dir=None, result_filename=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [01:12<00:00, 2.74it/s]s
============ Serving Benchmark Result ============
Successful requests: 200
Benchmark duration (s): 72.96
Total input tokens: 49490
Total generated tokens: 41078
Request throughput (req/s): 2.74
Input token throughput (tok/s): 678.34
Output token throughput (tok/s): 563.04
---------------Time to First Token----------------
Mean TTFT (ms): 3594.18
Median TTFT (ms): 3685.95
P99 TTFT (ms): 7361.98
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 186.90
Median TPOT (ms): 121.63
P99 TPOT (ms): 966.47
---------------Inter-token Latency----------------
Mean ITL (ms): 121.20
Median ITL (ms): 92.91
P99 ITL (ms): 310.89
==================================================
Observed similar results on my experiments. It seems like TPOT is calculated with the final "[Done]" latency included, whereas ITL does not include the final latency, as shown here. Would like some more explanation on the difference between these metrics.
Can confirm on my exps, especially for marlin24 model, ITL is much lower than TPOT, while TPOT for marlin24 is much higher than norm GPTQ model with marlin kernel.
check code here: https://github.com/vllm-project/vllm/blob/a2469127db6144eedb38d0b505287c0044e4ce06/benchmarks/benchmark_serving.py#L271
the output len of TPOT calculation is based on tokenized len instead of real output token number from model, if the output is wrong, then output len is less than real output token number significantly, in norm case, they are close, if these 2 output lens are same, then TPOT is equal to ILT.
inter-token latency takes TTFT
@Jeffwan This is no longer the case and has been fixed by #7372.
The reason why we use separate calculations for TPOT is that sometimes ITL is not a reliable measure for the actual decoding performance. This is because sometimes multiple tokens can be bundled in one server-side event for certain backends/mechanism. Therefore we use the TPOT = (end-to-end latency - TTFT)/len(generated output token ids) as a proxy.
One other thing worth noting is that: TPOT is a per-request metric and ITL is a per-SSE metric.
IMHO the way we are defining ITL here is not very useful and potentially confusing. I think we should report only TTFT and TPOT (in other cases ITL is a synonym for TPOT).
It's mostly irrelevant if we return n tokens per SSE, since if it takes e.g. 100ms to return 5 tokens you can just make use of one every 20ms after you receive them. The initial cost of waiting for all 5 is already captured in the TTFT time.
IMHO the way we are defining ITL here is not very useful and potentially confusing. I think we should report only TTFT and TPOT (in other cases ITL is a synonym for TPOT).
It's mostly irrelevant if we return n tokens per SSE, since if it takes e.g. 100ms to return 5 tokens you can just make use of one every 20ms after you receive them. The initial cost of waiting for all 5 is already captured in the TTFT time.
@njhill That's very true - @tlrmchlsmth and I agreed on adding this metric into the benchmark previously when vLLM strictly follows 1 token per SSE protocol. I do think ITL is still valid to keep if we go back to follow that protocol in the future, just so we can get a sense of how the distribution of all decoding operations in certain setup looks like.
@njhill, @ywang96 what do you think about renaming ITL (inter-packet latency?), and hiding it behind an option?
I think it's still a good QoS metric to keep track of, as jittery output generation is going to be a worse user experience than output tokens generated at a constant rate, and the fact that we currently return N tokens per multistep rather than one token at a time is a tradeoff that's worth exposing in our benchmarking scripts. I agree that this is definitely confusing and less important than TPOT
I understand that the current naming of ITL might be causing some confusion. However, interpreting ITL as the inter-packet latency seems to contradict the problem mentioned here. If ITL measured here represents inter-packet latency, the TPOT should always be less than or equal to ITL, with equality occurring only in cases where single-step postprocessing is applied. This issue suggests the opposite, which is ITL is reported as smaller than TPOT, which indicates there may be a misunderstanding or an underlying issue worth investigating further in the benchmark script.
which is ITL is reported as smaller than TPOT,
@hyhuang00 yea that's indeed a good point.
The only possibility I can think of for this is when the model doesn't generate anything except especial tokens (EOS for example), so the generated text is empty but there's still ITL record for it. (Here output[i].itl is List[float])
https://github.com/vllm-project/vllm/blob/d9cd78eb718c233ebc5b84377fc2226af7ef0fa2/benchmarks/benchmark_serving.py#L338-L341
I think jitter is reasonable to meaure/report somehow. But it's only relevant imo when it's uneven - in the case of multistep where we have exactly N per response, this shouldn't be considered jitter imo, since the extra delay is captured in TTFT, and you could just evenly space these N tokens over the time between responses (that could be done client-side too).
If we have some other metric for it I think we should call it something completely different like MTBR "mean time between responses" or MTBOC "mean time between output chunks" or something like that. To avoid confusion with other ITL/TPOT perf metrics.
Actually maybe the variance rather than the mean which would make more sense for this purpose...
I see what you meant about it being captured in the TTFT now -- didn't understand that before but I agree that makes sense. BTW, isn't it partially captured in the TPOT as well, since you might wait an extra few steps after your final token?
I can throw up a PR to do the name change, and we can discuss further in there. Sounds good?
Thanks @tlrmchlsmth
isn't it partially captured in the TPOT as well, since you might wait an extra few steps after your final token?
Yes that's true, but I guess for larger numbers of tokens the amortized difference per token would be quite small.
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!
@ywang96
I have noticed a significant difference between TPOT and ITL in my benchmarking. After reviewing the code, I believe the discrepancy primarily arises from how token latency is averaged across different requests.
Suppose we have two requests with e2e latencies $L_1$ and $L_2$(ignoring TTFT), and the number of generated tokens for each request is $N_1$ and $N_2$, respectively. TPOT simply performs an arithmetic mean across different requests:
TPOT = \frac{TPOT_1 + TPOT_2}{2} = \frac{L_1 / N_1 + L_2 / N_2}{2}
On the other hand, ITL is calculated using a weighted average:
ITL = \frac{L_1 + L_2}{N_1 + N_2}
Naturally, these two methods yield different results. They are equal only when $N_1 = N_2$ or $TPOT_1 = TPOT_2$
Thanks @mzmssg, I had also noticed this and have been meaning to open a PR to propose changing the TPOT calculation to a uniform average.
IMO we should do that, and also change the name of what we currently call ITL to something else reflecting that it's actually time between chunks.
@njhill - curious if this has been solved? Is there interest in fixing this?