server
server copied to clipboard
`pb_utils.TritonError.NOT_HEALTHY` Error Code
Is your feature request related to a problem? Please describe.
The problem arises when handling a model's health issues (e.g. lack of CPU RAM). Currently, the error codes available (such as pb_utils.TritonError.UNKNOWN
, pb_utils.TritonError.INTERNAL
, etc.) do not specifically address the issue. When the problem occurs, the error in tritonclient.utils.InferenceServerException
like [StatusCode.INTERNAL] in ensemble 'some_ensemble', Failed to process the request(s) for model instance 'some_model_0_0', message: Stub process 'some_model_0_0' is not healthy
is received. This lack of specificity in error codes makes it challenging to implement efficient error handling, particularly when using NVIDIA Triton Inference Server with outside systems enabled with auto-retry handling (say, Celery).
Describe the solution you'd like
I propose the introduction of a new error code: pb_utils.TritonError.NOT_HEALTHY
. This error code would specifically indicate issues related to the health of a model, such as CPU RAM problems. With this specific error code, I could implement more targeted error handling strategies, such as auto-retrying requests to the NVIDIA Triton, knowing that the stub will be reinitialized subsequently. Alternatively, the error code pb_utils.TritonError.UNAVAILABLE
could be raised specifically for model health issues.
Describe alternatives you've considered The current alternative is to use the existing, more generalized error codes. However, this approach lacks precision and may lead to unnecessary auto-retries for various issues, resulting in a high rate of false positives.
@khaykingleb Thanks for your feature request. I think this is a reasonable request. @krishung5 Do you have any thoughts regarding this request?
I think that this request will help users to better handle the model health issues. Filed a feature request ticket (DLIS-6039).
Hello.
Any news on this feature ? I'm encountering issues on production with kube + triton server and the "not healthy" log isn't really helping me, I don't know what's wrong.
Thanks :)