gax-nodejs icon indicating copy to clipboard operation
gax-nodejs copied to clipboard

Allow to decide if to retry based on error message

Open kramarz opened this issue 4 years ago • 10 comments

Is your feature request related to a problem? Please describe. Getting error like:

{"name":"Error","message":"13 INTERNAL: Received RST_STREAM with code 2 triggered by internal client error: read ECONNRESET","stack":"at Object.callErrorFromStatus (/usr/local/autotrader/app/node_modules/@grpc/grpc-js/build/src/call.js:31:26)"}

I want to be able to retry such error, but not all INTERNAL errors. Describe the solution you'd like Allow to specify custom shouldRetryFn function that could return True if error message contains certain phrase. this function would be part of RetryOptions that you can pass as override to createApiCall Describe alternatives you've considered

  • It is currently possible to retry all errors with INTERNAL code.
  • Other solution would be adding separate error code for connection errors to grpc library, which could break existing solutions for others relying on INTERNAL code.
  • In streaming calls it is already possible to specify similar shouldRetryFn function. Normal calls should also have such option.

kramarz avatar Feb 15 '21 13:02 kramarz

I feel very strongly that this is a wrong solution for the problem. It will definitely put out the fire, and it will hide the real problem behind those 13 INTERNAL errors.

I don't want to block on this, and I believe the 13 INTERNAL affects not only Node.js libraries, but other languages as well. I suggest discussing this with gRPC team first (Cc: @murgatroid99).

alexander-fenster avatar Mar 02 '21 09:03 alexander-fenster

I agree that this is the wrong solution to the problem. The right solution is for the gRPC library to report UNAVAILABLE instead of INTERNAL for these errors, if I can find a way to isolate the correct subset of these errors that are connection errors and not other kinds of internal errors.

murgatroid99 avatar Mar 02 '21 18:03 murgatroid99

"gRPC library to report UNAVAILABLE instead of INTERNAL for these errors,"

Isn't this a different error type. While "It is currently possible to retry all errors with INTERNAL code" is stated I am not sure that is true. it is possible the backend had an error after altering state for instance.

crwilcox avatar Mar 02 '21 22:03 crwilcox

I meant "these errors" in a very narrow sense here, referring to the error messages ending with "triggered by internal client error: read ECONNRESET". Those indicate that the connection was dropped while waiting for a response, and UNAVAILABLE is the appropriate code for that.

In fact, I just published @grpc/grpc-js version 1.2.9 with a speculative fix that might make that change I just described. If you update your dependencies you can see if that works.

If that doesn't change anything, setting the environment variables GRPC_TRACE=call_stream and GRPC_VERBOSITY=DEBUG will output some logs including one with details on the error from Node that triggers these errors. Those logs could help improve this solution.

murgatroid99 avatar Mar 02 '21 22:03 murgatroid99

Do we know that while waiting for response nothing else occurred? We have traced issues returned under 13 RST_STREAM to be an abort on the backend. In the most recent instance we have found cases where it would be accurate to report UNAVAILABLE, I am just not sure how generally that could be applied.

If we can be sure that no change occurred as part of the request, making it clear it is safe for a user to retry would be very helpful.

crwilcox avatar Mar 02 '21 22:03 crwilcox

No, you can't be sure about that here. This error happens when you lose the connection to the server. By definition, you don't know what the server did after that.

murgatroid99 avatar Mar 03 '21 01:03 murgatroid99

By "It is currently possible to retry all errors with INTERNAL code" I meant that we could just create client overriding some settings to retry INTERNAL errors.

new ClusterManagerClient({clientConfig: {
    interfaces: {
        "google.container.v1.ClusterManager": {
          retry_codes: {
            non_idempotent: ["INTERNAL"],
            idempotent: [
              "INTERNAL",
              "DEADLINE_EXCEEDED",
              "UNAVAILABLE"
            ]
          },
}});

We don't want to do that.

Thank you for releasing @grpc/grpc-js changing these errors to UNAVAILABLE I will check if that would solve this issue.

There is one problem with these error codes that I hoped my approach would address. They cover very wide groups of errors. That would be great if we could decide if to retry on more granular level. I understand that error message is not reliable way, but is just easier than implementing code for every single error.

kramarz avatar Mar 03 '21 08:03 kramarz

Can you share other error messages you are seeing that you think should be retried? If there is a pattern there, it may warrant further modifications to grpc-js.

murgatroid99 avatar Mar 03 '21 08:03 murgatroid99

My answer to that would be that I was not thinking of anything specific, but I wanted to add flexibility for user to decide on case by case basis. Adding such function would for example allow user to do some calculation if they can afford to retry or not. But let me check if there are any other generally retriable error messages.

kramarz avatar Mar 03 '21 09:03 kramarz

@murgatroid99 It was the only error to my knowledge, but I asked for double-checking on that.

I get that there are 2 sides to this: Long Term it's certainly best to export the error codes that matter so you can handle them where necessary, so when it comes to the "good solution" I agree with you.

On the other hand, we all know that often enough we need to be able to override where possible as making those changes can take a long time and your production may be affected short term. In this sense, having the flexibility still makes sense, even if it could be abused by some. The old "with much power" problem sadly.

I'll get back to you if there are indeed other errors involved. But even when we did not have them, it wouldn't mean there aren't any others sadly.

sandrom avatar Mar 03 '21 12:03 sandrom