server
server copied to clipboard
Graceful Shutdown Condition Enhancement for NVIDIA Triton Server
Issue Description: NVIDIA Triton Server currently implements a graceful shutdown mechanism that is triggered only when there are no inflight inferences. However, it does not consider ongoing HTTP connections, which can lead to issues where the server shuts down models prematurely, causing unexpected behavior for ongoing requests.
Issue Summary: The current graceful shutdown logic in Triton Server should take into account both inflight inferences AND ongoing HTTP connections to ensure a truly graceful and reliable shutdown process.
Expected Behavior:
- Graceful shutdown should be conditional on:
- No inflight inferences for any models.
- No ongoing HTTP connections.
Actual Behavior: Currently, Triton Server may initiate a shutdown even if there are ongoing HTTP connections, which can result in unexpected behavior for clients.
Steps to Reproduce:
- Start Triton Server with multiple models.
- Send an HTTP request to one of the models.
- Trigger a graceful shutdown.
- Observe that the ongoing HTTP connection may still execute a model that is not ready.
Proposed Solution: To enhance the graceful shutdown behavior, we should modify the shutdown logic to wait until both of the following conditions are met:
- There are no inflight inferences for any models.
- There are no ongoing HTTP connections.
This change will ensure that Triton Server shuts down models only when it is safe to do so, preventing the issue where an ongoing HTTP connection attempts to execute an unready model during the shutdown process.
@jsoto-gladia , Thank you for your report! I'll file a ticket for our team to incorporate this suggestion
Also having this issue, and would prefer the signal handling being in triton-server versus using a process handler/monitor for these signals.
Hi @jsoto-gladia, can you provide a minimum reproduction with the client script and model? We would like to see the details on how the issue is encountered, so we can make sure the proposed solution can be implemented correctly.
I'm having a similar problem. In situations where there are pending requests in the queue, pending requests are not considered in-flight during a graceful shutdown. Consequently, these pending requests fail. From the client's side, it seems reasonable to expect the server to process these requests during the graceful shutdown period, given that the server has already accepted them.
current graceful shutdown mechanisms have difficulty achieving seamless rolling updates in a k8s env. solving this problem requires additional, inefficient operations such as pre-stop hooks.
Hi, we have enhanced the logic to wait until all ongoing HTTP connections to complete before starting to unload all models during shutdown, and new connections will be refused after the shutdown has been initiated. https://github.com/triton-inference-server/server/pull/6986
@kthui Is there any news about this? Need gRPC support. Can you tell me how to track gRPC connections? Thanks.