launch_triton_server.py attempts to place two models on the same GPU instead of one model on two GPUs
Description launch_triton_server.py attempts to place two models on the same GPU instead of one model on two GPUs, causing an out of memory issue. It will load one instance normally, and then attempt to load a second instance of the model onto GPU zero It's visible via nvidia-smi that this occurs:
Every 0.1s: nvidia-smi fdb2adce5ce5: Tue May 21 23:16:39 2024
Tue May 21 23:16:39 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100 80GB PCIe On | 00000000:00:09.0 Off | 0 |
| N/A 50C P0 87W / 300W | 73723MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A100 80GB PCIe On | 00000000:00:0A.0 Off | 0 |
| N/A 50C P0 79W / 300W | 491MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
Every 0.1s: nvidia-smi fdb2adce5ce5: Tue May 21 23:16:43 2024
Tue May 21 23:16:44 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100 80GB PCIe On | 00000000:00:09.0 Off | 0 |
| N/A 50C P0 91W / 300W | 78025MiB / 81920MiB | 48% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A100 80GB PCIe On | 00000000:00:0A.0 Off | 0 |
| N/A 50C P0 78W / 300W | 491MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
It will reach ~73GB, wait a few seconds, then increase beyond, eventually hitting >80GB and promptly OOM'ing.
Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] CUDA runtime error in ::cudaMalloc(ptr, n): out of memory
Here is the full output from triton (I included a printout of the actual executed command at the start):
debug output
['mpirun', '--allow-run-as-root', '-n', '1', '/opt/tritonserver/bin/tritonserver', '--model-repository=/opt/tritonserver/inflight_batcher_llm', '--grpc-port=8001', '--http-port=8000', '--metrics-port=8002', '--disable-auto-complete-config', '--backend-config=python,shm-region-prefix-name=prefix0_', ':']
root@fdb2adce5ce5:/workspace# I0521 23:22:30.566415 52252 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7f9ade000000' with size 268435456
I0521 23:22:30.572862 52252 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I0521 23:22:30.572876 52252 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864
W0521 23:22:30.725716 52252 server.cc:251] failed to enable peer access for some device pairs
I0521 23:22:30.727930 52252 model_lifecycle.cc:469] loading: postprocessing:1
I0521 23:22:30.727964 52252 model_lifecycle.cc:469] loading: preprocessing:1
I0521 23:22:30.727991 52252 model_lifecycle.cc:469] loading: tensorrt_llm:1
I0521 23:22:30.728011 52252 model_lifecycle.cc:469] loading: tensorrt_llm_bls:1
I0521 23:22:30.767941 52252 python_be.cc:2404] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_1 (CPU device 0)
I0521 23:22:30.768430 52252 python_be.cc:2404] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)
I0521 23:22:30.768768 52252 python_be.cc:2404] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_1 (CPU device 0)
I0521 23:22:30.770628 52252 python_be.cc:2404] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set
[TensorRT-LLM][WARNING] max_beam_width is not specified, will use default value of 1
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false.
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.9 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false
[TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise
[TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64
[TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8
[TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05
[TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB
[TensorRT-LLM][WARNING] medusa_choices parameter is not specified. Will be using default mc_sim_7b_63 choices instead
[TensorRT-LLM][INFO] Engine version 0.9.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found
[TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set.
[TensorRT-LLM][INFO] MPI size: 1, rank: 0
I0521 23:22:31.359947 52252 python_be.cc:2404] TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_bls_0_1 (CPU device 0)
I0521 23:22:31.360038 52252 python_be.cc:2404] TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_bls_0_0 (CPU device 0)
I0521 23:22:32.304181 52252 model_lifecycle.cc:835] successfully loaded 'tensorrt_llm_bls'
I0521 23:22:37.724164 52252 model_lifecycle.cc:835] successfully loaded 'postprocessing'
I0521 23:22:37.724444 52252 model_lifecycle.cc:835] successfully loaded 'preprocessing'
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 33280
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][WARNING] Device 0 peer access Device 1 is not available.
[TensorRT-LLM][INFO] Loaded engine size: 3937 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 3991, GPU 4447 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 3993, GPU 4457 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +3934, now: CPU 0, GPU 3934 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 4034, GPU 7625 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 4034, GPU 7633 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 3934 (MiB)
[TensorRT-LLM][INFO] Max tokens in paged KV cache: 528640. Allocating 69289902080 bytes.
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 260
[TensorRT-LLM][WARNING] max_beam_width is not specified, will use default value of 1
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false.
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.9 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false
[TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise
[TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64
[TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8
[TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05
[TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB
[TensorRT-LLM][WARNING] medusa_choices parameter is not specified. Will be using default mc_sim_7b_63 choices instead
[TensorRT-LLM][INFO] Engine version 0.9.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found
[TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set.
[TensorRT-LLM][INFO] MPI size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 33280
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] The logger passed into createInferRuntime differs from one already provided for an existing builder, runtime, or refitter. Uses of the global logger, returned by nvinfer1::getLogger(), will return the existing value.
[TensorRT-LLM][INFO] Loaded engine size: 3937 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 4064, GPU 77667 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 4064, GPU 77677 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +3934, now: CPU 0, GPU 7868 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 4064, GPU 80837 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 4064, GPU 80845 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 7868 (MiB)
E0521 23:22:49.848194 52252 backend_model.cc:691] ERROR: Failed to create instance: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] CUDA runtime error in ::cudaMalloc(ptr, n): out of memory (/tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmBuffers.h:87)
1 0x7f9b2017c293 void tensorrt_llm::common::check<cudaError>(cudaError, char const*, char const*, int) + 147
2 0x7f9aa86f1f1b tensorrt_llm::runtime::BufferManager::gpuSync(nvinfer1::Dims32, nvinfer1::DataType) + 379
3 0x7f9aa881a0d8 tensorrt_llm::batch_manager::kv_cache_manager::BlockManager::BlockManager(int, int, int, int, int, int, nvinfer1::DataType, std::shared_ptr<tensorrt_llm::runtime::CudaStream>, bool, bool) + 1752
4 0x7f9aa881a80f tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager::KVCacheManager(int, int, int, int, int, int, int, int, int, int, bool, nvinfer1::DataType, std::shared_ptr<tensorrt_llm::runtime::CudaStream>, bool, bool, bool) + 175
5 0x7f9aa8848c54 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::createKvCacheManager(tensorrt_llm::batch_manager::kv_cache_manager::KvCacheConfig const&) + 436
6 0x7f9aa884f517 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(int, std::shared_ptr<nvinfer1::ILogger>, tensorrt_llm::runtime::GptModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, std::vector<unsigned char, std::allocator<unsigned char> > const&, bool, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 4903
7 0x7f9aa880cd5a tensorrt_llm::batch_manager::TrtGptModelFactory::create(std::filesystem::__cxx11::path const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 1930
8 0x7f9aa8804170 tensorrt_llm::batch_manager::GptManager::GptManager(std::filesystem::__cxx11::path const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, std::function<std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::InferenceRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::InferenceRequest> > > (int)>, std::function<void (unsigned long, std::__cxx11::list<tensorrt_llm::batch_manager::NamedTensor, std::allocator<tensorrt_llm::batch_manager::NamedTensor> > const&, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)>, std::function<std::unordered_set<unsigned long, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<unsigned long> > ()>, std::function<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)>, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&, std::optional<unsigned long>, std::optional<int>, bool) + 336
9 0x7f9b20188075 triton::backend::inflight_batcher_llm::ModelInstanceState::ModelInstanceState(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*, ompi_communicator_t*) + 4901
10 0x7f9b20189019 triton::backend::inflight_batcher_llm::ModelInstanceState::Create(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*, triton::backend::inflight_batcher_llm::ModelInstanceState**) + 73
11 0x7f9b3001441c TRITONBACKEND_ModelInstanceInitialize + 828
12 0x7f9b37b24086 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1af086) [0x7f9b37b24086]
13 0x7f9b37b252c6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1b02c6) [0x7f9b37b252c6]
14 0x7f9b37b078d5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1928d5) [0x7f9b37b078d5]
15 0x7f9b37b07f16 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x192f16) [0x7f9b37b07f16]
16 0x7f9b37b1480d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19f80d) [0x7f9b37b1480d]
17 0x7f9b37176ee8 /lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7f9b37176ee8]
18 0x7f9b37afe64b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18964b) [0x7f9b37afe64b]
19 0x7f9b37b0f4f5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19a4f5) [0x7f9b37b0f4f5]
20 0x7f9b37b13c2e /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19ec2e) [0x7f9b37b13c2e]
21 0x7f9b37c08318 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x293318) [0x7f9b37c08318]
22 0x7f9b37c0bbfc /opt/tritonserver/bin/../lib/libtritonserver.so(+0x296bfc) [0x7f9b37c0bbfc]
23 0x7f9b37d67a02 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3f2a02) [0x7f9b37d67a02]
24 0x7f9b373e2253 /lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f9b373e2253]
25 0x7f9b37171ac3 /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f9b37171ac3]
26 0x7f9b37202a04 clone + 68
E0521 23:22:50.095556 52252 model_lifecycle.cc:638] failed to load 'tensorrt_llm' version 1: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] CUDA runtime error in ::cudaMalloc(ptr, n): out of memory
I0521 23:22:50.095787 52252 server.cc:607]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+
I0521 23:22:50.095821 52252 server.cc:634]
+-------------+-------------------------------------------------------+-------------------------------------------------------+
| Backend | Path | Config |
+-------------+-------------------------------------------------------+-------------------------------------------------------+
| python | /opt/tritonserver/backends/python/libtriton_python.so | {"cmdline":{"auto-complete-config":"false","backend-d |
| | | irectory":"/opt/tritonserver/backends","min-compute-c |
| | | apability":"6.000000","shm-region-prefix-name":"prefi |
| | | x0_","default-max-batch-size":"4"}} |
| tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton_tens | {"cmdline":{"auto-complete-config":"false","backend-d |
| | orrtllm.so | irectory":"/opt/tritonserver/backends","min-compute-c |
| | | apability":"6.000000","default-max-batch-size":"4"}} |
| | | |
+-------------+-------------------------------------------------------+-------------------------------------------------------+
I0521 23:22:50.095916 52252 server.cc:677]
+------------------+---------+------------------------------------------------------------------------------------------------+
| Model | Version | Status |
+------------------+---------+------------------------------------------------------------------------------------------------+
| postprocessing | 1 | READY |
| preprocessing | 1 | READY |
| tensorrt_llm | 1 | UNAVAILABLE: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR |
| | | ] CUDA runtime error in ::cudaMalloc(ptr, n): out of memory (/tmp/tritonbuild/tensorrtllm/tens |
Triton Information What version of Triton are you using? Container 24.04, tensorrt llm backend
Are you using the Triton container or did you build it yourself? Using the Triton container.
To Reproduce
Steps to reproduce the behavior.
This is on a particular finetune of Mistral 7b, but it's a vanilla Mistral 7b architecturally. Seems highly unlikely the model has anything to do with this as it runs fine on one gpu.
python3 quantize.py --model_dir <> --dtype float16 --qformat int4_awq --awq_block_size 128 --output_dir <> --calib_size 32 --batch_size 16
trtllm-build --checkpoint_dir <> --gemm_plugin float16 --gpt_attention_plugin float16 --output_dir <> --paged_kv_cache enable --max_input_len 32256 --use_paged_context_fmha enable --context_fmha enable --max_batch_size 4
Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well). Attached is the config file. Most critically, here is the instance group part:
instance_group [
{
count: 1
kind: KIND_GPU
}
]
note the count: 1
this may be of use: when I run it with
instance_group [
{
count: 1
kind : KIND_CPU
}
]
it will place one instance on GPU 0 and nothing on GPU 1. Both GPUs are visible during both launches:
I0521 23:21:27.576684 51741 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I0521 23:21:27.576697 51741 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864
Expected behavior A clear and concise description of what you expected to happen.
It should load one copy of the model onto gpu 0, and one to gpu 1.
Below is the full config.pbtxt under /opt/tritonserver/inflight_batcher_llm/tensorrt_llm/
config.pbtxt
:backend: "tensorrtllm"
max_batch_size: 4
model_transaction_policy {
decoupled: false
}
dynamic_batching {
preferred_batch_size: [ 4 ]
max_queue_delay_microseconds: 10000
}
input [
{
name: "input_ids"
data_type: TYPE_INT32
dims: [ -1 ]
allow_ragged_batch: true
},
{
name: "input_lengths"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
},
{
name: "request_output_len"
data_type: TYPE_INT32
dims: [ 1 ]
},
{
name: "draft_input_ids"
data_type: TYPE_INT32
dims: [ -1 ]
optional: true
allow_ragged_batch: true
},
{
name: "draft_logits"
data_type: TYPE_FP32
dims: [ -1, -1 ]
optional: true
allow_ragged_batch: true
},
{
name: "end_id"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "pad_id"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "stop_words_list"
data_type: TYPE_INT32
dims: [ 2, -1 ]
optional: true
allow_ragged_batch: true
},
{
name: "bad_words_list"
data_type: TYPE_INT32
dims: [ 2, -1 ]
optional: true
allow_ragged_batch: true
},
{
name: "embedding_bias"
data_type: TYPE_FP32
dims: [ -1 ]
optional: true
allow_ragged_batch: true
},
{
name: "beam_width"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "temperature"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "runtime_top_k"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "runtime_top_p"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "len_penalty"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "repetition_penalty"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "min_length"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "presence_penalty"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "frequency_penalty"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "random_seed"
data_type: TYPE_UINT64
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "return_log_probs"
data_type: TYPE_BOOL
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "return_context_logits"
data_type: TYPE_BOOL
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "return_generation_logits"
data_type: TYPE_BOOL
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "stop"
data_type: TYPE_BOOL
dims: [ 1 ]
optional: true
},
{
name: "streaming"
data_type: TYPE_BOOL
dims: [ 1 ]
optional: true
},
{
name: "prompt_embedding_table"
data_type: TYPE_FP16
dims: [ -1, -1 ]
optional: true
allow_ragged_batch: true
},
{
name: "prompt_vocab_size"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
# the unique task ID for the given LoRA.
# To perform inference with a specific LoRA for the first time `lora_task_id` `lora_weights` and `lora_config` must all be given.
# The LoRA will be cached, so that subsequent requests for the same task only require `lora_task_id`.
# If the cache is full the oldest LoRA will be evicted to make space for new ones. An error is returned if `lora_task_id` is not cached.
{
name: "lora_task_id"
data_type: TYPE_UINT64
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
# weights for a lora adapter shape [ num_lora_modules_layers, D x Hi + Ho x D ]
# where the last dimension holds the in / out adapter weights for the associated module (e.g. attn_qkv) and model layer
# each of the in / out tensors are first flattened and then concatenated together in the format above.
# D=adapter_size (R value), Hi=hidden_size_in, Ho=hidden_size_out.
{
name: "lora_weights"
data_type: TYPE_FP16
dims: [ -1, -1 ]
optional: true
allow_ragged_batch: true
},
# module identifier (same size a first dimension of lora_weights)
# See LoraModule::ModuleType for model id mapping
#
# "attn_qkv": 0 # compbined qkv adapter
# "attn_q": 1 # q adapter
# "attn_k": 2 # k adapter
# "attn_v": 3 # v adapter
# "attn_dense": 4 # adapter for the dense layer in attention
# "mlp_h_to_4h": 5 # for llama2 adapter for gated mlp layer after attention / RMSNorm: up projection
# "mlp_4h_to_h": 6 # for llama2 adapter for gated mlp layer after attention / RMSNorm: down projection
# "mlp_gate": 7 # for llama2 adapter for gated mlp later after attention / RMSNorm: gate
#
# last dim holds [ module_id, layer_idx, adapter_size (D aka R value) ]
{
name: "lora_config"
data_type: TYPE_INT32
dims: [ -1, 3 ]
optional: true
allow_ragged_batch: true
}
]
output [
{
name: "output_ids"
data_type: TYPE_INT32
dims: [ -1, -1 ]
},
{
name: "sequence_length"
data_type: TYPE_INT32
dims: [ -1 ]
},
{
name: "cum_log_probs"
data_type: TYPE_FP32
dims: [ -1 ]
},
{
name: "output_log_probs"
data_type: TYPE_FP32
dims: [ -1, -1 ]
},
{
name: "context_logits"
data_type: TYPE_FP32
dims: [ -1, -1 ]
},
{
name: "generation_logits"
data_type: TYPE_FP32
dims: [ -1, -1, -1 ]
}
]
instance_group [
{
count: 1
kind: KIND_GPU
}
]
parameters: {
key: "max_beam_width"
value: {
string_value: "${max_beam_width}"
}
}
parameters: {
key: "FORCE_CPU_ONLY_INPUT_TENSORS"
value: {
string_value: "no"
}
}
parameters: {
key: "gpt_model_type"
value: {
string_value: "inflight_fused_batching"
}
}
parameters: {
key: "gpt_model_path"
value: {
string_value: "/opt/tritonserver/inflight_batcher_llm/tensorrt_llm/1/"
}
}
parameters: {
key: "max_tokens_in_paged_kv_cache"
value: {
string_value: "${max_tokens_in_paged_kv_cache}"
}
}
parameters: {
key: "max_attention_window_size"
value: {
string_value: "${max_attention_window_size}"
}
}
parameters: {
key: "batch_scheduler_policy"
value: {
string_value: "${batch_scheduler_policy}"
}
}
parameters: {
key: "kv_cache_free_gpu_mem_fraction"
value: {
string_value: "${kv_cache_free_gpu_mem_fraction}"
}
}
parameters: {
key: "enable_trt_overlap"
value: {
string_value: "${enable_trt_overlap}"
}
}
parameters: {
key: "exclude_input_in_output"
value: {
string_value: "true"
}
}
parameters: {
key: "enable_kv_cache_reuse"
value: {
string_value: "true"
}
}
parameters: {
key: "normalize_log_probs"
value: {
string_value: "${normalize_log_probs}"
}
}
parameters: {
key: "enable_chunked_context"
value: {
string_value: "${enable_chunked_context}"
}
}
parameters: {
key: "gpu_device_ids"
value: {
string_value: "${gpu_device_ids}"
}
}
parameters: {
key: "lora_cache_optimal_adapter_size"
value: {
string_value: "${lora_cache_optimal_adapter_size}"
}
}
parameters: {
key: "lora_cache_max_adapter_size"
value: {
string_value: "${lora_cache_max_adapter_size}"
}
}
parameters: {
key: "lora_cache_gpu_memory_fraction"
value: {
string_value: "${lora_cache_gpu_memory_fraction}"
}
}
parameters: {
key: "lora_cache_host_memory_bytes"
value: {
string_value: "${lora_cache_host_memory_bytes}"
}
}
parameters: {
key: "decoding_mode"
value: {
string_value: "${decoding_mode}"
}
}
parameters: {
key: "worker_path"
value: {
string_value: "/opt/tritonserver/backends/tensorrtllm/triton_tensorrtllm_worker"
}
}
parameters: {
key: "medusa_choices"
value: {
string_value: "${medusa_choices}"
}
}