djl
djl copied to clipboard
TensortRT-LLM compilation parameter overwrite
Description
When deploying a Mistral Instruct 7B v.02 on a SageMaker endpoint (ml.g5.12xlarge) using the TensortRT-LLM backend (just-in-time compilation), I noticed that some of the serving parameters get overwritten.
Specifically, I used the following set of serving properties: "SERVING_ENGINE": "MPI", "OPTION_TENSOR_PARALLEL_DEGREE": "1", "OPTION_MAX_ROLLING_BATCH_SIZE": "16", "OPTION_ROLLING_BATCH":"trtllm", "OPTION_MAX_INPUT_LEN":"2048", "OPTION_MAX_OUTPUT_LEN":"16", "OPTION_BATCH_SCHEDULER_POLICY": "max_utilization"
CouldWatch logs state the following:
- max_input_len is 2048 is larger than max_seq_len 16, clipping it to max_seq_len
- max_num_tokens (256) shouldn't be greater than max_seq_len * max_batch_size (256), specifying to max_seq_len * max_batch_size (256).
max_num_tokens is marked in the documentation as taking the default value 16384
Expected Behavior
Parameters to preserve their supplied value
Error Message
When submitting inference requests:
this model is compiled to take up to 16 tokens. But actual tokens is 987 > 16. Please set with option.max_input_len=987
How to Reproduce?
Recipe to reproduce the error presented in description
Environment Info
Docker image: 763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.29.0-tensorrtllm0.11.0-cu124
PASTE OUTPUT HERE