server
server copied to clipboard
Tritonbackend multiple instances of the same model run sequentially instead of in parallel on the same device with asyncrhonous requests.
Description
We'd like to implement Ensemble modeling to run inference with the following workflow :
We managed to run both "construction" and "civil" model in parallel. However we've noticed that each model instance only took about 1GB of GPU Memory. Which is why we've increased the number of model instances for each specific model in order to increase throughput. However we've noticed that although tritonbackend creates multiple instances of the same model, they do not run in parallel but rather sequentially, which causes no change or augmentation in throughput.
Triton Information We're using NGC container 22.04
System Information AWS Instance : g4dn.xlarge
To Reproduce
Here is an example of the results we've obtained using "construction" model which is an efficientdet based object detection model with a static batch-size = 10. The configuration file we're using :
platform: "tensorrt_plan"
max_batch_size: 0
input [
{
name: "stack_Concat__898:0"
data_type: TYPE_FP16
dims: [ 10,512,1280,3 ]
}
]
output [
{
name: "detection_boxes"
data_type: TYPE_FP16
dims: [ 10,100,4 ]
},
{
name: "detection_classes"
data_type: TYPE_INT32
dims: [10,100 ]
},
{
name: "detection_scores"
data_type: TYPE_FP16
dims: [10,100]
},
{
name: "num_detections"
data_type: TYPE_INT32
dims: [10,-1]
}
]
instance_group [
{
count: 2
kind: KIND_GPU
gpus: [ 0 ]
}
]
model_warmup [{
name : "image sample"
batch_size: 0
inputs {
key: "stack_Concat__898:0"
value: {
data_type: TYPE_FP16
dims: [10, 512, 1280, 3 ]
random_data: true
}
}
}]
Command used to run the docker container :
docker run --rm -it --gpus all --name triton --ipc=host -p8000:8000 -p8001:8001 -p8002:8002 --shm-size=8gb --ulimit memlock=-1 --ulimit stack=568435456 --user root:root -v $(pwd):/server -t triton
Command used to run triton server :
tritonserver --model-repository=triton-models --model-control-mode=explicit --load-model construction --cuda-memory-pool-byte-size 0:268435456 --pinned-memory-pool-byte-size --http-thread-count 10
Output with instance count = 1 in construction model config file :
root@50208ced5836:/server# tritonserver --model-repository=triton-models --model-control-mode=explicit --load-model construction --cuda-memory-pool-byte-size 0:268435456 --pinned-memory-pool-byte-size 548843545 --http-thread-count 10
Warning: LBR backtrace method is not supported on this platform. DWARF backtrace method will be used.
I0517 12:35:10.705208 574 libtorch.cc:1381] TRITONBACKEND_Initialize: pytorch
I0517 12:35:10.705272 574 libtorch.cc:1391] Triton TRITONBACKEND API version: 1.9
I0517 12:35:10.705283 574 libtorch.cc:1397] 'pytorch' TRITONBACKEND API version: 1.9
2022-05-17 12:35:10.887381: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
I0517 12:35:10.928375 574 tensorflow.cc:2181] TRITONBACKEND_Initialize: tensorflow
I0517 12:35:10.928427 574 tensorflow.cc:2191] Triton TRITONBACKEND API version: 1.9
I0517 12:35:10.928436 574 tensorflow.cc:2197] 'tensorflow' TRITONBACKEND API version: 1.9
I0517 12:35:10.928444 574 tensorflow.cc:2221] backend configuration:
{}
I0517 12:35:10.930059 574 onnxruntime.cc:2400] TRITONBACKEND_Initialize: onnxruntime
I0517 12:35:10.930085 574 onnxruntime.cc:2410] Triton TRITONBACKEND API version: 1.9
I0517 12:35:10.930095 574 onnxruntime.cc:2416] 'onnxruntime' TRITONBACKEND API version: 1.9
I0517 12:35:10.930102 574 onnxruntime.cc:2446] backend configuration:
{}
I0517 12:35:10.948213 574 openvino.cc:1207] TRITONBACKEND_Initialize: openvino
I0517 12:35:10.948244 574 openvino.cc:1217] Triton TRITONBACKEND API version: 1.9
I0517 12:35:10.948254 574 openvino.cc:1223] 'openvino' TRITONBACKEND API version: 1.9
I0517 12:35:12.997021 574 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7efb62000000' with size 548843545
I0517 12:35:12.997448 574 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 268435456
I0517 12:35:12.999575 574 model_repository_manager.cc:1077] loading: construction:1
I0517 12:35:13.101033 574 tensorrt.cc:5294] TRITONBACKEND_Initialize: tensorrt
I0517 12:35:13.101069 574 tensorrt.cc:5304] Triton TRITONBACKEND API version: 1.9
I0517 12:35:13.101079 574 tensorrt.cc:5310] 'tensorrt' TRITONBACKEND API version: 1.9
I0517 12:35:13.101155 574 tensorrt.cc:5353] backend configuration:
{}
I0517 12:35:13.101182 574 tensorrt.cc:5405] TRITONBACKEND_ModelInitialize: construction (version 1)
I0517 12:35:13.102603 574 tensorrt.cc:5454] TRITONBACKEND_ModelInstanceInitialize: construction_0 (GPU device 0)
I0517 12:35:13.625109 574 logging.cc:49] [MemUsageChange] Init CUDA: CPU +471, GPU +0, now: CPU 2156, GPU 958 (MiB)
I0517 12:35:13.758068 574 logging.cc:49] Loaded engine size: 89 MiB
I0517 12:35:14.731365 574 logging.cc:49] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +786, GPU +224, now: CPU 3156, GPU 1272 (MiB)
I0517 12:35:14.930925 574 logging.cc:49] [MemUsageChange] Init cuDNN: CPU +162, GPU +52, now: CPU 3318, GPU 1324 (MiB)
I0517 12:35:14.933229 574 logging.cc:49] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +83, now: CPU 0, GPU 83 (MiB)
I0517 12:35:14.939696 574 logging.cc:49] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 3140, GPU 1316 (MiB)
I0517 12:35:14.941363 574 logging.cc:49] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 3140, GPU 1324 (MiB)
I0517 12:35:14.982711 574 logging.cc:49] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +1, GPU +856, now: CPU 1, GPU 939 (MiB)
I0517 12:35:14.983374 574 tensorrt.cc:1411] Created instance construction_0 on GPU 0 with stream priority 0 and optimization profile default[0];
I0517 12:35:14.984087 574 model_repository_manager.cc:1231] successfully loaded 'construction' version 1
I0517 12:35:14.984262 574 server.cc:549]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+
I0517 12:35:14.984423 574 server.cc:576]
+-------------+-------------------------------------------------------------------------+--------+
| Backend | Path | Config |
+-------------+-------------------------------------------------------------------------+--------+
| pytorch | /opt/tritonserver/backends/pytorch/libtriton_pytorch.so | {} |
| tensorflow | /opt/tritonserver/backends/tensorflow1/libtriton_tensorflow1.so | {} |
| onnxruntime | /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so | {} |
| openvino | /opt/tritonserver/backends/openvino_2021_4/libtriton_openvino_2021_4.so | {} |
| tensorrt | /opt/tritonserver/backends/tensorrt/libtriton_tensorrt.so | {} |
+-------------+-------------------------------------------------------------------------+--------+
I0517 12:35:14.984474 574 server.cc:619]
+--------------+---------+--------+
| Model | Version | Status |
+--------------+---------+--------+
| construction | 1 | READY |
+--------------+---------+--------+
I0517 12:35:14.996284 574 metrics.cc:650] Collecting metrics for GPU 0: Tesla T4
I0517 12:35:14.996586 574 tritonserver.cc:2123]
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option | Value |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id | triton |
| server_version | 2.21.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics trace |
| model_repository_path[0] | triton-models |
| model_control_mode | MODE_EXPLICIT |
| startup_models_0 | construction |
| strict_model_config | 1 |
| rate_limit | OFF |
| pinned_memory_pool_byte_size | 548843545 |
| cuda_memory_pool_byte_size{0} | 268435456 |
| response_cache_byte_size | 0 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 |
| exit_timeout | 30 |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
I0517 12:35:14.999922 574 grpc_server.cc:4544] Started GRPCInferenceService at 0.0.0.0:8001
I0517 12:35:15.000238 574 http_server.cc:3242] Started HTTPService at 0.0.0.0:8000
I0517 12:35:15.067631 574 http_server.cc:180] Started Metrics Service at 0.0.0.0:8002
Output with instance count = 2 in construction model config file :
root@50208ced5836:/server# tritonserver --model-repository=triton-models --model-control-mode=explicit --load-model construction --cuda-memory-pool-byte-size 0:268435456 --pinned-memory-pool-byte-size 548843545 --http-thread-count 10
Warning: LBR backtrace method is not supported on this platform. DWARF backtrace method will be used.
I0517 12:55:04.150003 776 libtorch.cc:1381] TRITONBACKEND_Initialize: pytorch
I0517 12:55:04.150060 776 libtorch.cc:1391] Triton TRITONBACKEND API version: 1.9
I0517 12:55:04.150067 776 libtorch.cc:1397] 'pytorch' TRITONBACKEND API version: 1.9
2022-05-17 12:55:04.331099: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
I0517 12:55:04.371862 776 tensorflow.cc:2181] TRITONBACKEND_Initialize: tensorflow
I0517 12:55:04.371894 776 tensorflow.cc:2191] Triton TRITONBACKEND API version: 1.9
I0517 12:55:04.371901 776 tensorflow.cc:2197] 'tensorflow' TRITONBACKEND API version: 1.9
I0517 12:55:04.371906 776 tensorflow.cc:2221] backend configuration:
{}
I0517 12:55:04.373518 776 onnxruntime.cc:2400] TRITONBACKEND_Initialize: onnxruntime
I0517 12:55:04.373542 776 onnxruntime.cc:2410] Triton TRITONBACKEND API version: 1.9
I0517 12:55:04.373549 776 onnxruntime.cc:2416] 'onnxruntime' TRITONBACKEND API version: 1.9
I0517 12:55:04.373554 776 onnxruntime.cc:2446] backend configuration:
{}
I0517 12:55:04.390901 776 openvino.cc:1207] TRITONBACKEND_Initialize: openvino
I0517 12:55:04.390923 776 openvino.cc:1217] Triton TRITONBACKEND API version: 1.9
I0517 12:55:04.390929 776 openvino.cc:1223] 'openvino' TRITONBACKEND API version: 1.9
I0517 12:55:06.398005 776 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f5556000000' with size 548843545
I0517 12:55:06.398473 776 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 268435456
I0517 12:55:06.400672 776 model_repository_manager.cc:1077] loading: construction:1
I0517 12:55:06.502072 776 tensorrt.cc:5294] TRITONBACKEND_Initialize: tensorrt
I0517 12:55:06.502103 776 tensorrt.cc:5304] Triton TRITONBACKEND API version: 1.9
I0517 12:55:06.502114 776 tensorrt.cc:5310] 'tensorrt' TRITONBACKEND API version: 1.9
I0517 12:55:06.502196 776 tensorrt.cc:5353] backend configuration:
{}
I0517 12:55:06.502227 776 tensorrt.cc:5405] TRITONBACKEND_ModelInitialize: construction (version 1)
I0517 12:55:06.503616 776 tensorrt.cc:5454] TRITONBACKEND_ModelInstanceInitialize: construction_0_0 (GPU device 0)
I0517 12:55:07.031461 776 logging.cc:49] [MemUsageChange] Init CUDA: CPU +471, GPU +0, now: CPU 2156, GPU 958 (MiB)
I0517 12:55:07.165348 776 logging.cc:49] Loaded engine size: 89 MiB
I0517 12:55:08.127716 776 logging.cc:49] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +786, GPU +224, now: CPU 3156, GPU 1272 (MiB)
I0517 12:55:08.318249 776 logging.cc:49] [MemUsageChange] Init cuDNN: CPU +162, GPU +52, now: CPU 3318, GPU 1324 (MiB)
I0517 12:55:08.320482 776 logging.cc:49] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +83, now: CPU 0, GPU 83 (MiB)
I0517 12:55:08.326906 776 logging.cc:49] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 3140, GPU 1316 (MiB)
I0517 12:55:08.328530 776 logging.cc:49] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 3140, GPU 1324 (MiB)
I0517 12:55:08.373881 776 logging.cc:49] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +1, GPU +856, now: CPU 1, GPU 939 (MiB)
I0517 12:55:08.374452 776 tensorrt.cc:1411] Created instance construction_0_0 on GPU 0 with stream priority 0 and optimization profile default[0];
I0517 12:55:08.375143 776 tensorrt.cc:5454] TRITONBACKEND_ModelInstanceInitialize: construction_0_1 (GPU device 0)
I0517 12:55:08.376915 776 logging.cc:49] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 3149, GPU 2230 (MiB)
I0517 12:55:08.377903 776 logging.cc:49] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 3149, GPU 2240 (MiB)
I0517 12:55:08.386167 776 logging.cc:49] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +857, now: CPU 1, GPU 1796 (MiB)
I0517 12:55:08.386634 776 tensorrt.cc:1411] Created instance construction_0_1 on GPU 0 with stream priority 0 and optimization profile default[0];
I0517 12:55:08.386809 776 model_repository_manager.cc:1231] successfully loaded 'construction' version 1
I0517 12:55:08.387147 776 server.cc:549]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+
I0517 12:55:08.387221 776 server.cc:576]
+-------------+-------------------------------------------------------------------------+--------+
| Backend | Path | Config |
+-------------+-------------------------------------------------------------------------+--------+
| pytorch | /opt/tritonserver/backends/pytorch/libtriton_pytorch.so | {} |
| tensorflow | /opt/tritonserver/backends/tensorflow1/libtriton_tensorflow1.so | {} |
| onnxruntime | /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so | {} |
| openvino | /opt/tritonserver/backends/openvino_2021_4/libtriton_openvino_2021_4.so | {} |
| tensorrt | /opt/tritonserver/backends/tensorrt/libtriton_tensorrt.so | {} |
+-------------+-------------------------------------------------------------------------+--------+
I0517 12:55:08.387266 776 server.cc:619]
+--------------+---------+--------+
| Model | Version | Status |
+--------------+---------+--------+
| construction | 1 | READY |
+--------------+---------+--------+
I0517 12:55:08.399238 776 metrics.cc:650] Collecting metrics for GPU 0: Tesla T4
I0517 12:55:08.399578 776 tritonserver.cc:2123]
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option | Value |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id | triton |
| server_version | 2.21.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics trace |
| model_repository_path[0] | triton-models |
| model_control_mode | MODE_EXPLICIT |
| startup_models_0 | construction |
| strict_model_config | 1 |
| rate_limit | OFF |
| pinned_memory_pool_byte_size | 548843545 |
| cuda_memory_pool_byte_size{0} | 268435456 |
| response_cache_byte_size | 0 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 |
| exit_timeout | 30 |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
I0517 12:55:08.403199 776 grpc_server.cc:4544] Started GRPCInferenceService at 0.0.0.0:8001
I0517 12:55:08.403457 776 http_server.cc:3242] Started HTTPService at 0.0.0.0:8000
I0517 12:55:08.467533 776 http_server.cc:180] Started Metrics Service at 0.0.0.0:8002
Steps to reproduce the behavior.
Script to send inference requests :
import numpy as np
import tritonclient.http as httpclient
import logging
import time
logging.basicConfig(level=logging.INFO)
logging.getLogger("TritonClient").setLevel(logging.INFO)
log = logging.getLogger("TritonClient")
number_of_requests=10
requests_concurrency=10
triton_client = httpclient.InferenceServerClient(url="localhost:8000", verbose=False, concurrency=requests_concurrency )
model_name = "construction"
model_metadata = triton_client.get_model_metadata(model_name)
input_names = [str(i["name"]) for i in model_metadata["inputs"]]
output_names = [str(i["name"]) for i in model_metadata["outputs"]]
async_requests=[]
final_results =[]
# Loading dummy input for the model
image_data = np.ones([10,512,1280,3], np.float16)
inputs=[httpclient.InferInput(input_names[0], image_data.shape, "FP16")]
outputs=[httpclient.InferRequestedOutput(output) for output in output_names ]
inputs[0].set_data_from_numpy(image_data)
begining=time.time()
for j in range(number_of_requests):
start=time.time()
async_requests=[]
for i in range(requests_concurrency) :
async_requests.append(triton_client.async_infer(model_name=model_name,
inputs=inputs,
outputs=outputs))
for async_request in async_requests:
final_results.append(async_request.get_result())
log.info("Time taken to run {} asynchronous inference requests is {} ".format(requests_concurrency, time.time()-start))
log.info("Total time to run {} inference requests is {}".format(requests_concurrency*number_of_requests, time.time()-begining))
Output with instance count = 1 in construction model config file :
INFO:TritonClient:Time taken to run 10 asynchronous inference requests is 3.992401599884033
INFO:TritonClient:Time taken to run 10 asynchronous inference requests is 3.018524408340454
INFO:TritonClient:Time taken to run 10 asynchronous inference requests is 3.034783124923706
INFO:TritonClient:Time taken to run 10 asynchronous inference requests is 3.047888994216919
INFO:TritonClient:Time taken to run 10 asynchronous inference requests is 3.0408923625946045
INFO:TritonClient:Time taken to run 10 asynchronous inference requests is 3.076260805130005
INFO:TritonClient:Time taken to run 10 asynchronous inference requests is 3.1126108169555664
INFO:TritonClient:Time taken to run 10 asynchronous inference requests is 3.134840965270996
INFO:TritonClient:Time taken to run 10 asynchronous inference requests is 3.1259348392486572
INFO:TritonClient:Time taken to run 10 asynchronous inference requests is 3.1820130348205566
INFO:TritonClient:Total time to run 100 inference requests is 31.767416954040527
Output with instance count = 2 in construction model config file :
INFO:TritonClient:Time taken to run 10 asynchronous inference requests is 3.9438741207122803
INFO:TritonClient:Time taken to run 10 asynchronous inference requests is 3.020892381668091
INFO:TritonClient:Time taken to run 10 asynchronous inference requests is 3.042755603790283
INFO:TritonClient:Time taken to run 10 asynchronous inference requests is 3.0679333209991455
INFO:TritonClient:Time taken to run 10 asynchronous inference requests is 3.0309293270111084
INFO:TritonClient:Time taken to run 10 asynchronous inference requests is 3.0512776374816895
INFO:TritonClient:Time taken to run 10 asynchronous inference requests is 3.068094491958618
INFO:TritonClient:Time taken to run 10 asynchronous inference requests is 3.0799334049224854
INFO:TritonClient:Time taken to run 10 asynchronous inference requests is 3.097849130630493
INFO:TritonClient:Time taken to run 10 asynchronous inference requests is 3.078545570373535
INFO:TritonClient:Total time to run 100 inference requests is 31.483280658721924
Expected behavior We've expected the throughput to double by increasing the instance count number of the model to 2. since the GPU utilization is not saturated. However there was no change throughput.
Number of model instances | GPU Utilization | Average time to run 10 asynchronous inference requests |
---|---|---|
1 | 2440MiB / 15109MiB | 3.148s |
2 | 3356MiB / 15109MiB | 3.176s |
4 | 5188MiB / 15109MiB | 3.182s |
Hi @Abdellah-Laassairi, I think you need to specify instance_group
field in each composing model to make it support parallel execution individually. Please refer to the Ensemble Model Instance Groups section.
Hello! We did specify the instance_group
field for each composing model however that didn't change anything. In the example I've provided I've only loaded one model and changing the instance_group
count of that model didn't change latency nor throughput.
Any updates?
Hi, could you monitor the GPU utilization while running both cases to see if the GPU is already fully utilized with the first instance?
Hello! GPU utilization was at 2.4GB while running case 1 (1 model instance) and at 3.3GB in case 2 (2 model instances), the available memory for a g4dn instance is about 16GB, so it's far from being fully utilized.
The GPU utilization refers to whether the GPU is computing at full capacity, if you use nvidia-smi
to monitor GPU status, you will notice that GPU utilization is a different field (the 2% one below) from the GPU memory usage
$ nvidia-smi
Mon May 23 10:16:44 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA TITAN V On | 00000000:65:00.0 On | N/A |
| 29% 43C P2 32W / 250W | 680MiB / 12288MiB | 2% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
Hello, thanks for your answer, here's the output for both cases :
instance count = 1
Tue May 24 08:23:03 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05 Driver Version: 450.51.05 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 48C P0 35W / 70W | 2240MiB / 15109MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
instance count = 2
Tue May 24 08:28:54 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05 Driver Version: 450.51.05 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 47C P0 69W / 70W | 3356MiB / 15109MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
It looks like your model is so complex (or GPU has little compute resources) that one model already fully loads the compute resources of your GPU. On a bigger GPU with enough compute resources two (or more) models could run in parallel, but in your case the models have to be executed sequentially. Note that next to nvidia-smi you can use the nvidia nsight-systems tool (nsys) to get fine-grain info on how the two model inferences load your GPU.
Hello! I was already using nsys to trace cuda and cudnn. The output in the timeline when instance count = 1 is as I expect it to be. However It's a bit difficult to get a good understanding on how the two instances of the same model behave in nsys since both of them are executed on the same thread. In the mean time I'd try to use a simpler model to verify if the GPU compute resources is the bottleneck in this case or not.
Closing issue due to lack of activity. Please re-open the issue if you would like to follow up with this issue
I am having same issue, but with models running on CPU. Instead of running in parallel it seems that models excute in sequence.
@Abdellah-Laassairi @marchaarsma You can refer to this code auto execution_policy = TRITONBACKEND_EXECUTION_DEVICE_BLOCKING; ,My understanding is that multi-instance tensorrt backend will be set in blocking mode on one device, and parallel effects can be achieved on different devices.