djl icon indicating copy to clipboard operation
djl copied to clipboard

TensortRT-LLM compilation parameter overwrite

Open CoolFish88 opened this issue 1 year ago • 0 comments
trafficstars

Description

When deploying a Mistral Instruct 7B v.02 on a SageMaker endpoint (ml.g5.12xlarge) using the TensortRT-LLM backend (just-in-time compilation), I noticed that some of the serving parameters get overwritten.

Specifically, I used the following set of serving properties: "SERVING_ENGINE": "MPI", "OPTION_TENSOR_PARALLEL_DEGREE": "1", "OPTION_MAX_ROLLING_BATCH_SIZE": "16", "OPTION_ROLLING_BATCH":"trtllm", "OPTION_MAX_INPUT_LEN":"2048", "OPTION_MAX_OUTPUT_LEN":"16", "OPTION_BATCH_SCHEDULER_POLICY": "max_utilization"

CouldWatch logs state the following:

  • max_input_len is 2048 is larger than max_seq_len 16, clipping it to max_seq_len
  • max_num_tokens (256) shouldn't be greater than max_seq_len * max_batch_size (256), specifying to max_seq_len * max_batch_size (256).

max_num_tokens is marked in the documentation as taking the default value 16384

Expected Behavior

Parameters to preserve their supplied value

Error Message

When submitting inference requests: this model is compiled to take up to 16 tokens. But actual tokens is 987 > 16. Please set with option.max_input_len=987

How to Reproduce?

Recipe to reproduce the error presented in description

Environment Info

Docker image: 763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.29.0-tensorrtllm0.11.0-cu124

PASTE OUTPUT HERE

CoolFish88 avatar Oct 01 '24 11:10 CoolFish88