TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

In-flight batching and mixed batch

Open huijjj opened this issue 1 year ago • 0 comments

Image According to the document it seems like the packed & mixed batch is the default behavior of TensorRT-LLM.

So I've conducted an experiment to see the effect of mixed batching by changing the max_num_tokens to appropriate size. Other setups are as follows:

  • benchmarked with 1024 samples
    • random generated fixed-length input tokens of length 1024
    • fixed-length output tokens of length 1024 (done by ignoring the EOS token)
  • max_batch_size set to 256
  • llama3 8B model
  • request arrival rate set to infinite (all requests are sent at the start)

And here are the results of max_num_tokens set to 2048, 2050, 2064 and 2176 respectively:

max_num_tokens avg TTFT (ms) avg TPOT (ms) avg token throuput (token/sec)
2048 265435.6 67.22 3323.19
2050 265873.7 67.00 3319.35
2064 268012.8 67.49 3291.81
2176 265234.8 66.75 3325.43

I expected max_num_tokens 2048 will batch only 2 summarization phase sequences whereas 2050, 2064 and 2176 will batch extra generation phase sequences up to 2, 16 and 128. The results show marginal change, and shows no clear evidence of mixed batching. Note that the TTFT results are quite broken as benchmark was done with request rate(or Query Per Second) set to infinite.

So, Q1: Is mixed batch(batching sequence in summarization phase and generation phase in a single batch) still enabled by default? Q2: If true, then why couldn't I see the difference and how should I design an experiment to see the impact of mixed batching. Q3: If not, how can I enable it, and what was the reason to drop it from the default option? Q4: Are there any materials including the code to get a better understanding of the batch manager(the request scheduler) in TensorRT-LLM, as for me, nothing seems to be open.

huijjj avatar Oct 10 '24 06:10 huijjj