TensorRT-LLM feat: add chunked context/prefill runtime option to trtllm-serve

based on the newest 3/25/2025 main branch:

before:

-# trtllm-serve --host 0.0.0.0 --tokenizer Llama-3.1-8B-Instruct-FP8 engines/Llama-3.1-8B-Instruct-FP8
[TensorRT-LLM] TensorRT-LLM version: 0.17.0
[TensorRT-LLM] TensorRT-LLM version: 0.17.0
[TensorRT-LLM][INFO] Engine version 0.17.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 16
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 16
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 131072
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (131072) * 32
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 0
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
-[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8192 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 8738 MiB

after:

+# trtllm-serve --host 0.0.0.0 --tokenizer Llama-3.1-8B-Instruct-FP8/ engines/Llama-3.1-8B-Instruct-FP8 --chunked_context
[TensorRT-LLM] TensorRT-LLM version: 0.17.0
[TensorRT-LLM] TensorRT-LLM version: 0.17.0
[TensorRT-LLM][INFO] Engine version 0.17.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 16
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 16
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 131072
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (131072) * 32
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 0
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
+[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 131071  = maxSequenceLen - 1 since chunked context is enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: 131072 = maxSequenceLen.
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 8738 MiB

Jan 31 '25 17:01 tsnyder-sps

Whether or not to expose more configurations for trtllm-serve is still under discussion, we may want to do various configurations in a similar way as trtllm-bench.

Feb 10 '25 09:02 chuangz0

@kaiyux @LinPoly pls help review this MR.

Thanks June

Mar 24 '25 05:03 juney-nvidia

@tsnyder-sps Thanks for the contribution. The branch seems to be largely diverged from the main branch, can you please help rebase your branch so that the diff page only contains your changes? Thanks.

Mar 24 '25 07:03 kaiyux

@kaiyux all set, rebased on the latest main commit with some required formatting changes

Mar 25 '25 15:03 tsnyder-sps

IIUC, this option is not included in our yaml file options, so it basically LGTM. BTW, there is no sign-off in the commit message, is it necessary to external contribution? @kaiyux

Mar 26 '25 05:03 LinPoly

IIUC, this option is not included in our yaml file options, so it basically LGTM. BTW, there is no sign-off in the commit message, is it necessary to external contribution? @kaiyux

The --extra_llm_api_options arg should be able to cover this, because enable_chunked_prefill is part of LlmArgs class.

@tsnyder-sps A bigger context is that, we tried not to introduce too many arguments to trtllm-serve command, and we're trying to make the performance good by default. If customize arguments are really needed, our current suggestion is to pass those options through the --extra_llm_api_options argument. That said, chunked context should already be able to be enabled by that way. Does that satisfy your requirement here?

@LinPoly To answer your second question - yes, we require sign-off for everyone who contributes to the repo.

Please let me know if there are further questions, thanks.

Mar 26 '25 06:03 kaiyux

@kaiyux thanks for the info about the --extra_llm_api_options param, it seems that capability was added after this was opened and adds significantly more functionality, I'll close this out and thanks!

Mar 26 '25 13:03 tsnyder-sps