server How to set the parameter make concurrent model execution?

Description I noticed the "Concurrent Model Execution" section. Titron can enable parallel execution of the model when adjusting instance_group.

After adjusting instance_group to 4. I didn't find parallel execution situation. I only noticed that the cuda stream has increased. Are there any parameters need to adjust? Could you give me some suggestions? The picture below is the result of sending two requests(same model) at the same time and observing with nsight. 螢幕擷取畫面 2024-08-30 143240

The command as follow: Triton server ./tritonserver --model-repository=../docs/examples/model_repository/ Client ./perf_analyzer -m tensorrt_fp16_model --service-kind=triton -i HTTP & \ ./perf_analyzer -m tensorrt_fp16_model --service-kind=triton -i http

Information Nsysight Version: 2024.2.2.28-242234212449v0 Linux Hardware NVIDIA Jetson AGX Orin Jetpack 6.0 Triton server version:2.48.0

Aug 30 '24 06:08 Will-Chou-5722

Hello Will, I reported a similar issue: #7706 I'm wondering if you got any clue on the solution? Thanks!

Oct 15 '24 22:10 lei1liu

Hi @Will-Chou-5722,

I think your observations look correct. The TensorRT Backend specifically is unique in that it uses one thread for multiple model instances on the same GPU, whereas most other backends will have one thread per model instance. You can read more details about this TensorRT Backend behavior here (two different comments):

https://github.com/triton-inference-server/server/issues/4319#issuecomment-1118928192
https://github.com/triton-inference-server/server/issues/4319#issuecomment-1302830304

Nov 15 '24 08:11 rmccorm4

Hi @rmccorm4 Thank you for the information. It's very helpful.

Nov 22 '24 09:11 Will-Chou-5722

Hello @rmccorm4 ,

I have been looking into the implementation of model instances in core/backend_model_instance.cc and noticed that Triton seems to spawn a separate thread for each model instance as you mentioned above . While this architecture is effective for isolating and managing instances, I had a few questions regarding thread management and fault handling:

Thread Tracking:

Is there an existing mechanism within Triton to trace the lifecycle and flow of threads associated with model instances? Specifically, is there a way to monitor if a thread executing a model instance is terminated or killed unexpectedly ?

Fault Handling:

In scenarios where a model instance's thread is unexpectedly terminated, does Triton have a built-in recovery mechanism to handle such failures gracefully? Or is this situation expected to be managed entirely at the backend level?

I’d appreciate your guidance on whether this is something that Triton handles natively, or if thread-level monitoring and recovery should be implemented by backend developers.

Thank you for your assistance!

Dec 05 '24 15:12 Hemaprasannakc

@rmccorm4 , @dyastremsky Could you help me with above understanding? Any guidance would greatly help me on this.

Jan 08 '25 04:01 Hemaprasannakc

server server copied to clipboard

How to set the parameter make concurrent model execution?

server
server copied to clipboard