gax-nodejs
gax-nodejs copied to clipboard
Allow to decide if to retry based on error message
Is your feature request related to a problem? Please describe. Getting error like:
{"name":"Error","message":"13 INTERNAL: Received RST_STREAM with code 2 triggered by internal client error: read ECONNRESET","stack":"at Object.callErrorFromStatus (/usr/local/autotrader/app/node_modules/@grpc/grpc-js/build/src/call.js:31:26)"}
I want to be able to retry such error, but not all INTERNAL errors. Describe the solution you'd like Allow to specify custom shouldRetryFn function that could return True if error message contains certain phrase. this function would be part of RetryOptions that you can pass as override to createApiCall Describe alternatives you've considered
- It is currently possible to retry all errors with INTERNAL code.
- Other solution would be adding separate error code for connection errors to grpc library, which could break existing solutions for others relying on INTERNAL code.
- In streaming calls it is already possible to specify similar shouldRetryFn function. Normal calls should also have such option.
I feel very strongly that this is a wrong solution for the problem. It will definitely put out the fire, and it will hide the real problem behind those 13 INTERNAL
errors.
I don't want to block on this, and I believe the 13 INTERNAL
affects not only Node.js libraries, but other languages as well. I suggest discussing this with gRPC team first (Cc: @murgatroid99).
I agree that this is the wrong solution to the problem. The right solution is for the gRPC library to report UNAVAILABLE instead of INTERNAL for these errors, if I can find a way to isolate the correct subset of these errors that are connection errors and not other kinds of internal errors.
"gRPC library to report UNAVAILABLE instead of INTERNAL for these errors,"
Isn't this a different error type. While "It is currently possible to retry all errors with INTERNAL code" is stated I am not sure that is true. it is possible the backend had an error after altering state for instance.
I meant "these errors" in a very narrow sense here, referring to the error messages ending with "triggered by internal client error: read ECONNRESET". Those indicate that the connection was dropped while waiting for a response, and UNAVAILABLE
is the appropriate code for that.
In fact, I just published @grpc/grpc-js
version 1.2.9 with a speculative fix that might make that change I just described. If you update your dependencies you can see if that works.
If that doesn't change anything, setting the environment variables GRPC_TRACE=call_stream
and GRPC_VERBOSITY=DEBUG
will output some logs including one with details on the error from Node that triggers these errors. Those logs could help improve this solution.
Do we know that while waiting for response nothing else occurred? We have traced issues returned under 13 RST_STREAM to be an abort on the backend. In the most recent instance we have found cases where it would be accurate to report UNAVAILABLE, I am just not sure how generally that could be applied.
If we can be sure that no change occurred as part of the request, making it clear it is safe for a user to retry would be very helpful.
No, you can't be sure about that here. This error happens when you lose the connection to the server. By definition, you don't know what the server did after that.
By "It is currently possible to retry all errors with INTERNAL code" I meant that we could just create client overriding some settings to retry INTERNAL errors.
new ClusterManagerClient({clientConfig: {
interfaces: {
"google.container.v1.ClusterManager": {
retry_codes: {
non_idempotent: ["INTERNAL"],
idempotent: [
"INTERNAL",
"DEADLINE_EXCEEDED",
"UNAVAILABLE"
]
},
}});
We don't want to do that.
Thank you for releasing @grpc/grpc-js
changing these errors to UNAVAILABLE
I will check if that would solve this issue.
There is one problem with these error codes that I hoped my approach would address. They cover very wide groups of errors. That would be great if we could decide if to retry on more granular level. I understand that error message is not reliable way, but is just easier than implementing code for every single error.
Can you share other error messages you are seeing that you think should be retried? If there is a pattern there, it may warrant further modifications to grpc-js.
My answer to that would be that I was not thinking of anything specific, but I wanted to add flexibility for user to decide on case by case basis. Adding such function would for example allow user to do some calculation if they can afford to retry or not. But let me check if there are any other generally retriable error messages.
@murgatroid99 It was the only error to my knowledge, but I asked for double-checking on that.
I get that there are 2 sides to this: Long Term it's certainly best to export the error codes that matter so you can handle them where necessary, so when it comes to the "good solution" I agree with you.
On the other hand, we all know that often enough we need to be able to override where possible as making those changes can take a long time and your production may be affected short term. In this sense, having the flexibility still makes sense, even if it could be abused by some. The old "with much power" problem sadly.
I'll get back to you if there are indeed other errors involved. But even when we did not have them, it wouldn't mean there aren't any others sadly.