feat: add chunked context/prefill runtime option to trtllm-serve
based on the newest 3/25/2025 main branch:
before:
-# trtllm-serve --host 0.0.0.0 --tokenizer Llama-3.1-8B-Instruct-FP8 engines/Llama-3.1-8B-Instruct-FP8
[TensorRT-LLM] TensorRT-LLM version: 0.17.0
[TensorRT-LLM] TensorRT-LLM version: 0.17.0
[TensorRT-LLM][INFO] Engine version 0.17.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 16
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 16
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 131072
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (131072) * 32
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 0
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
-[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8192 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 8738 MiB
after:
+# trtllm-serve --host 0.0.0.0 --tokenizer Llama-3.1-8B-Instruct-FP8/ engines/Llama-3.1-8B-Instruct-FP8 --chunked_context
[TensorRT-LLM] TensorRT-LLM version: 0.17.0
[TensorRT-LLM] TensorRT-LLM version: 0.17.0
[TensorRT-LLM][INFO] Engine version 0.17.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 16
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 16
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 131072
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (131072) * 32
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 0
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
+[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 131071 = maxSequenceLen - 1 since chunked context is enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: 131072 = maxSequenceLen.
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 8738 MiB
Whether or not to expose more configurations for trtllm-serve is still under discussion, we may want to do various configurations in a similar way as trtllm-bench.
@kaiyux @LinPoly pls help review this MR.
Thanks June
@tsnyder-sps Thanks for the contribution. The branch seems to be largely diverged from the main branch, can you please help rebase your branch so that the diff page only contains your changes? Thanks.
@kaiyux all set, rebased on the latest main commit with some required formatting changes
IIUC, this option is not included in our yaml file options, so it basically LGTM. BTW, there is no sign-off in the commit message, is it necessary to external contribution? @kaiyux
IIUC, this option is not included in our yaml file options, so it basically LGTM. BTW, there is no sign-off in the commit message, is it necessary to external contribution? @kaiyux
The --extra_llm_api_options arg should be able to cover this, because enable_chunked_prefill is part of LlmArgs class.
@tsnyder-sps A bigger context is that, we tried not to introduce too many arguments to trtllm-serve command, and we're trying to make the performance good by default. If customize arguments are really needed, our current suggestion is to pass those options through the --extra_llm_api_options argument. That said, chunked context should already be able to be enabled by that way. Does that satisfy your requirement here?
@LinPoly To answer your second question - yes, we require sign-off for everyone who contributes to the repo.
Please let me know if there are further questions, thanks.
@kaiyux thanks for the info about the --extra_llm_api_options param, it seems that capability was added after this was opened and adds significantly more functionality, I'll close this out and thanks!