tensorrtllm_backend icon indicating copy to clipboard operation
tensorrtllm_backend copied to clipboard

Whisper - Missing parameters for triton deployment using tensorrt_llm backend

Open eleapttn opened this issue 10 months ago • 1 comments

System Info

Hello,

I'm trying to deploy Whisper large-v3 using Triton and tensorrtllm backend using this readme: https://github.com/triton-inference-server/tensorrtllm_backend/blob/v0.16.0/docs/whisper.md

Context

  • hardware: L40S
  • version of tensorrtllm_backend: v0.16.0
  • checkpoint conversion done (success)
  • TensorRT-LLM engines building done (success)

Issues

However, I have some issues when I'm trying to go to step 3 (Prepare Tritonserver configs) due to missing parameters to fill in config file using the following script:

python3 tools/fill_template.py -i model_repo_whisper/tensorrt_llm/config.pbtxt triton_backend:${BACKEND},engine_dir:${DECODER_ENGINE_PATH},encoder_engine_dir:${ENCODER_ENGINE_PATH},decoupled_mode:${DECOUPLED_MODE},max_tokens_in_paged_kv_cache:${MAX_TOKENS_IN_KV_CACHE},max_attention_window_size:${MAX_ATTENTION_WINDOW_SIZE},batch_scheduler_policy:${BATCH_SCHEDULER_POLICY},batching_strategy:${BATCHING_STRATEGY},kv_cache_free_gpu_mem_fraction:${KV_CACHE_FREE_GPU_MEM_FRACTION},exclude_input_in_output:${EXCLUDE_INPUT_IN_OUTPUT},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MICROSECONDS},max_beam_width:${MAX_BEAM_WIDTH},enable_kv_cache_reuse:${ENABLE_KV_CACHE_REUSE},normalize_log_probs:${NORMALIZE_LOG_PROBS},enable_chunked_context:${ENABLE_CHUNKED_CONTEXT},gpu_device_ids:${GPU_DEVICE_IDS},decoding_mode:${DECODING_MODE},max_queue_size:${MAX_QUEUE_SIZE},enable_context_fmha_fp32_acc:${ENABLE_CONTEXT_FMHA_FP32_ACC},cross_kv_cache_fraction:${CROSS_KV_CACHE_FRACTION},encoder_input_features_data_type:TYPE_FP16

My questions are:

  • Why do we need a tensorrt_llm "model" to run the triton server for whisper_bls ?
  • If it's required, how to set up these parameters for a Whisper model?

Thank you 🙂

Who can help?

@juney-nvidia

Information

  • [X] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

In the https://github.com/triton-inference-server/tensorrtllm_backend/blob/v0.16.0/docs/whisper.md, at step 3:

BACKEND=tensorrtllm
DECOUPLED_MODE=false
DECODER_ENGINE_PATH=${output_dir}/decoder
ENCODER_ENGINE_PATH=${output_dir}/encoder
MAX_TOKENS_IN_KV_CACHE=24000
BATCHING_STRATEGY=inflight_fused_batching
KV_CACHE_FREE_GPU_MEM_FRACTION=0.5
EXCLUDE_INPUT_IN_OUTPUT=True
TRITON_MAX_BATCH_SIZE=8
MAX_QUEUE_DELAY_MICROSECONDS=0
MAX_BEAM_WIDTH=1
MAX_QUEUE_SIZE="0"
ENABLE_KV_CACHE_REUSE=false
ENABLE_CHUNKED_CONTEXT=false
CROSS_KV_CACHE_FRACTION="0.5"
n_mels=128
zero_pad=false

python3 tools/fill_template.py -i model_repo_whisper/tensorrt_llm/config.pbtxt triton_backend:${BACKEND},engine_dir:${DECODER_ENGINE_PATH},encoder_engine_dir:${ENCODER_ENGINE_PATH},decoupled_mode:${DECOUPLED_MODE},max_tokens_in_paged_kv_cache:${MAX_TOKENS_IN_KV_CACHE},max_attention_window_size:${MAX_ATTENTION_WINDOW_SIZE},batch_scheduler_policy:${BATCH_SCHEDULER_POLICY},batching_strategy:${BATCHING_STRATEGY},kv_cache_free_gpu_mem_fraction:${KV_CACHE_FREE_GPU_MEM_FRACTION},exclude_input_in_output:${EXCLUDE_INPUT_IN_OUTPUT},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MICROSECONDS},max_beam_width:${MAX_BEAM_WIDTH},enable_kv_cache_reuse:${ENABLE_KV_CACHE_REUSE},normalize_log_probs:${NORMALIZE_LOG_PROBS},enable_chunked_context:${ENABLE_CHUNKED_CONTEXT},gpu_device_ids:${GPU_DEVICE_IDS},decoding_mode:${DECODING_MODE},max_queue_size:${MAX_QUEUE_SIZE},enable_context_fmha_fp32_acc:${ENABLE_CONTEXT_FMHA_FP32_ACC},cross_kv_cache_fraction:${CROSS_KV_CACHE_FRACTION},encoder_input_features_data_type:TYPE_FP16

Expected behavior

Variable not found when running the script:

python3 tools/fill_template.py -i model_repo_whisper/tensorrt_llm/config.pbtxt ...

Or in tritonserver logs:

[libprotobuf ERROR /tmp/tritonbuild/tritonserver/build/_deps/repo-third-party-build/grpc-repo/src/grpc/third_party/protobuf/src/google/protobuf/text_format.cc:337] Error parsing text-format inference.ModelConfig: 105:16: Expected integer or identifier, got: $
E0102 18:16:16.688605 46342 model_repository_manager.cc:1460] "Poll failed for model directory 'tensorrt_llm': failed to read text proto from /workspace/model_repo/l40s/openai_whisper-large-v3_int8/tensorrt_llm/config.pbtxt"

actual behavior

Missing parameters to fill the config.pbtxtx

additional notes

Tried to add the parameters as follow but still missing other parameters

MAX_ATTENTION_WINDOW_SIZE=448
BATCH_SCHEDULER_POLICY=max_utilization
NORMALIZE_LOG_PROBS=false
GPU_DEVICE_IDS=""
DECODING_MODE=""
ENABLE_CONTEXT_FMHA_FP32_ACC=true

eleapttn avatar Jan 02 '25 18:01 eleapttn