tensorrtllm_backend
tensorrtllm_backend copied to clipboard
Whisper - Missing parameters for triton deployment using tensorrt_llm backend
System Info
Hello,
I'm trying to deploy Whisper large-v3 using Triton and tensorrtllm backend using this readme: https://github.com/triton-inference-server/tensorrtllm_backend/blob/v0.16.0/docs/whisper.md
Context
- hardware: L40S
- version of tensorrtllm_backend:
v0.16.0 - checkpoint conversion done (success)
- TensorRT-LLM engines building done (success)
Issues
However, I have some issues when I'm trying to go to step 3 (Prepare Tritonserver configs) due to missing parameters to fill in config file using the following script:
python3 tools/fill_template.py -i model_repo_whisper/tensorrt_llm/config.pbtxt triton_backend:${BACKEND},engine_dir:${DECODER_ENGINE_PATH},encoder_engine_dir:${ENCODER_ENGINE_PATH},decoupled_mode:${DECOUPLED_MODE},max_tokens_in_paged_kv_cache:${MAX_TOKENS_IN_KV_CACHE},max_attention_window_size:${MAX_ATTENTION_WINDOW_SIZE},batch_scheduler_policy:${BATCH_SCHEDULER_POLICY},batching_strategy:${BATCHING_STRATEGY},kv_cache_free_gpu_mem_fraction:${KV_CACHE_FREE_GPU_MEM_FRACTION},exclude_input_in_output:${EXCLUDE_INPUT_IN_OUTPUT},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MICROSECONDS},max_beam_width:${MAX_BEAM_WIDTH},enable_kv_cache_reuse:${ENABLE_KV_CACHE_REUSE},normalize_log_probs:${NORMALIZE_LOG_PROBS},enable_chunked_context:${ENABLE_CHUNKED_CONTEXT},gpu_device_ids:${GPU_DEVICE_IDS},decoding_mode:${DECODING_MODE},max_queue_size:${MAX_QUEUE_SIZE},enable_context_fmha_fp32_acc:${ENABLE_CONTEXT_FMHA_FP32_ACC},cross_kv_cache_fraction:${CROSS_KV_CACHE_FRACTION},encoder_input_features_data_type:TYPE_FP16
My questions are:
- Why do we need a
tensorrt_llm"model" to run the triton server forwhisper_bls? - If it's required, how to set up these parameters for a Whisper model?
Thank you 🙂
Who can help?
@juney-nvidia
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [X] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
In the https://github.com/triton-inference-server/tensorrtllm_backend/blob/v0.16.0/docs/whisper.md, at step 3:
BACKEND=tensorrtllm
DECOUPLED_MODE=false
DECODER_ENGINE_PATH=${output_dir}/decoder
ENCODER_ENGINE_PATH=${output_dir}/encoder
MAX_TOKENS_IN_KV_CACHE=24000
BATCHING_STRATEGY=inflight_fused_batching
KV_CACHE_FREE_GPU_MEM_FRACTION=0.5
EXCLUDE_INPUT_IN_OUTPUT=True
TRITON_MAX_BATCH_SIZE=8
MAX_QUEUE_DELAY_MICROSECONDS=0
MAX_BEAM_WIDTH=1
MAX_QUEUE_SIZE="0"
ENABLE_KV_CACHE_REUSE=false
ENABLE_CHUNKED_CONTEXT=false
CROSS_KV_CACHE_FRACTION="0.5"
n_mels=128
zero_pad=false
python3 tools/fill_template.py -i model_repo_whisper/tensorrt_llm/config.pbtxt triton_backend:${BACKEND},engine_dir:${DECODER_ENGINE_PATH},encoder_engine_dir:${ENCODER_ENGINE_PATH},decoupled_mode:${DECOUPLED_MODE},max_tokens_in_paged_kv_cache:${MAX_TOKENS_IN_KV_CACHE},max_attention_window_size:${MAX_ATTENTION_WINDOW_SIZE},batch_scheduler_policy:${BATCH_SCHEDULER_POLICY},batching_strategy:${BATCHING_STRATEGY},kv_cache_free_gpu_mem_fraction:${KV_CACHE_FREE_GPU_MEM_FRACTION},exclude_input_in_output:${EXCLUDE_INPUT_IN_OUTPUT},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MICROSECONDS},max_beam_width:${MAX_BEAM_WIDTH},enable_kv_cache_reuse:${ENABLE_KV_CACHE_REUSE},normalize_log_probs:${NORMALIZE_LOG_PROBS},enable_chunked_context:${ENABLE_CHUNKED_CONTEXT},gpu_device_ids:${GPU_DEVICE_IDS},decoding_mode:${DECODING_MODE},max_queue_size:${MAX_QUEUE_SIZE},enable_context_fmha_fp32_acc:${ENABLE_CONTEXT_FMHA_FP32_ACC},cross_kv_cache_fraction:${CROSS_KV_CACHE_FRACTION},encoder_input_features_data_type:TYPE_FP16
Expected behavior
Variable not found when running the script:
python3 tools/fill_template.py -i model_repo_whisper/tensorrt_llm/config.pbtxt ...
Or in tritonserver logs:
[libprotobuf ERROR /tmp/tritonbuild/tritonserver/build/_deps/repo-third-party-build/grpc-repo/src/grpc/third_party/protobuf/src/google/protobuf/text_format.cc:337] Error parsing text-format inference.ModelConfig: 105:16: Expected integer or identifier, got: $
E0102 18:16:16.688605 46342 model_repository_manager.cc:1460] "Poll failed for model directory 'tensorrt_llm': failed to read text proto from /workspace/model_repo/l40s/openai_whisper-large-v3_int8/tensorrt_llm/config.pbtxt"
actual behavior
Missing parameters to fill the config.pbtxtx
additional notes
Tried to add the parameters as follow but still missing other parameters
MAX_ATTENTION_WINDOW_SIZE=448
BATCH_SCHEDULER_POLICY=max_utilization
NORMALIZE_LOG_PROBS=false
GPU_DEVICE_IDS=""
DECODING_MODE=""
ENABLE_CONTEXT_FMHA_FP32_ACC=true