server
server copied to clipboard
How to set the parameter make concurrent model execution?
Description
I noticed the "Concurrent Model Execution" section.
Titron can enable parallel execution of the model when adjusting instance_group.
After adjusting instance_group to 4. I didn't find parallel execution situation. I only noticed that the cuda stream has increased.
Are there any parameters need to adjust? Could you give me some suggestions?
The picture below is the result of sending two requests(same model) at the same time and observing with nsight.
The command as follow:
Triton server
./tritonserver --model-repository=../docs/examples/model_repository/
Client
./perf_analyzer -m tensorrt_fp16_model --service-kind=triton -i HTTP & \ ./perf_analyzer -m tensorrt_fp16_model --service-kind=triton -i http
Information Nsysight Version: 2024.2.2.28-242234212449v0 Linux Hardware NVIDIA Jetson AGX Orin Jetpack 6.0 Triton server version:2.48.0
Hello Will, I reported a similar issue: #7706 I'm wondering if you got any clue on the solution? Thanks!
Hi @Will-Chou-5722,
I think your observations look correct. The TensorRT Backend specifically is unique in that it uses one thread for multiple model instances on the same GPU, whereas most other backends will have one thread per model instance. You can read more details about this TensorRT Backend behavior here (two different comments):
- https://github.com/triton-inference-server/server/issues/4319#issuecomment-1118928192
- https://github.com/triton-inference-server/server/issues/4319#issuecomment-1302830304
Hi @rmccorm4 Thank you for the information. It's very helpful.
Hello @rmccorm4 ,
I have been looking into the implementation of model instances in core/backend_model_instance.cc and noticed that Triton seems to spawn a separate thread for each model instance as you mentioned above . While this architecture is effective for isolating and managing instances, I had a few questions regarding thread management and fault handling:
Thread Tracking:
Is there an existing mechanism within Triton to trace the lifecycle and flow of threads associated with model instances? Specifically, is there a way to monitor if a thread executing a model instance is terminated or killed unexpectedly ?
Fault Handling:
In scenarios where a model instance's thread is unexpectedly terminated, does Triton have a built-in recovery mechanism to handle such failures gracefully? Or is this situation expected to be managed entirely at the backend level?
I’d appreciate your guidance on whether this is something that Triton handles natively, or if thread-level monitoring and recovery should be implemented by backend developers.
Thank you for your assistance!
@rmccorm4 , @dyastremsky Could you help me with above understanding? Any guidance would greatly help me on this.