gax-python icon indicating copy to clipboard operation
gax-python copied to clipboard

Re-raise unrecognized exceptions on retry.

Open lukesneeringer opened this issue 7 years ago • 7 comments

Right now, if our retry logic encounters an exception it does not recognize, it raises RetryError which wraps the original exception.

This probably is not what we want. The end result is that the RetryError will end up masking the actual exception that needs to be fixed. The developer can still get at it (by calling .cause.exception()), but that is not clear or documented anywhere and there is no intuitive reason to try it.

I assert that it would be better to just raise the original exception, in the original context, and reserve RetryError for errors that the retry logic fundamentally understands the nature of (e.g. hitting the max retry limit).

If accepted, this PR:

  • Fixes GoogleCloudPlatform/google-cloud-python#3083
  • Closes GoogleCloudPlatform/google-cloud-python#3087 (as no longer necessary)

lukesneeringer avatar Mar 02 '17 17:03 lukesneeringer

FWIW, I think the "right" solution to this problem is to

  • make sure developers know of the existence of GaxError.cause
  • Intervene at the handwritten google-cloud-python level in cases where we don't want users to encounter a GaxError.

geigerj avatar Mar 02 '17 18:03 geigerj

@geigerj Good points, all. :-) I am not sure I am convinced. I need to figure out the right thing to say to https://github.com/GoogleCloudPlatform/python-docs-samples/issues/833, so let's figure out what that is. My best guess is that my change might not be where we end up, but the status quo is probably wrong also.

Simple users of the GAPIC should need to know that the gRPC layer exists. Therefore, they should never encounter gRPC errors directly. They should be able to assert that if an RPC error occurs while calling a GAPIC method, it will be a GaxError.

My gut tells me that this requirement is correct, at least as a design principle. This also has the advantage that if you want to, you can except GaxError and expect to get most/all things.

I am not as sure that it is correct in "unknown" cases (which is what this is). So, the fundamental exception here is something that our code was not ready for and literally caught with except Exception. So, the question is, which is better, to send up a "wrapped" version of the exception (with a not-obvious API), or send up the unrecognized exception to the caller?

I could see a pretty strong case that if we get GRPC errors that we should be able to handle and the code is unrecognized, we treat those as bugs that need to be wrapped in a GaxError and given an appropriate/helpful error.

Bottom line: I agree with you that we should avoid sending up transport-specific errors as a general practice. But, I am less sure that handling unknown cases with a sword this broad is wise.

I am intentionally inviting debate / disagreement here. Tell me I am wrong. :-)

(Meanwhile, advanced users can still access the underlying gRPC error via the attributes of the GaxError.)

Do we document how to do this? (If not, where is the right place where it would be found? Our docs are kind of all over the place right now.)

One should be able to configure any transport layer of GAX without changing its behavior. For example, if we implemented GAX support for HTTP/JSON, a RPC error that occurs during retrying should still be a GaxError.

Again, this sounds right as a design principle; I am unsure it should apply to unknown cases.

The impact of both of these is that gRPC errors are never raised from GAX, since ApiCallables are always wrapped to ensure they always raise GaxErrors when an exception occurs. It is inconsistent only to return the underlying error in the retry case.

I think you are wrong here. As best as I can read this code, to_catch is a whitelist. It will not catch any old error; it will only catch the errors it knows how to deal with. The proposed change would make retry errors act more like this code, would it not?

lukesneeringer avatar Mar 02 '17 18:03 lukesneeringer

Update: If we go with something like my strategy here, I am fairly certain that we would still consider https://github.com/GoogleCloudPlatform/python-docs-samples/issues/833 to be a bug; the bug would be that we are not appropriately catching the code being sent back by the Vision API.

lukesneeringer avatar Mar 02 '17 18:03 lukesneeringer

@jonparrott Could you add thoughts on this question? (Social note: I explicitly confirmed with Jacob that adding another voice here would be instructive.)

lukesneeringer avatar Mar 02 '17 18:03 lukesneeringer

Summary of offline discussion with @lukesneeringer:

  • We should not wrap all exceptions without any knowledge of what we're wrapping here.
  • We should aim to preserve the original design intention of raising GaxError in general to indicate API errors, regardless of the transport layer
  • There is already a mechanism for doing this used by api_callable and configured by config.API_ERRORS
  • We can reuse this mechanism here to ensure that we are wrapping API errors so that users don't have to worry about transport when writing exception handling code, without wrapping all exceptions
  • We should include the string representation of the wrapped exception in the GaxError message so that it is recorded when users log their exceptions
  • If the documentation of GaxError.cause isn't sufficient or visible enough, improve it

geigerj avatar Mar 02 '17 18:03 geigerj

I agree with where you guys ended up, let me re-iterate:

  1. Known (non-exceptional) transport exceptions should be wrapped.
  2. Unknown (exceptional) exceptions, such as KeyboardInterrupt, OSError, etc., should bubble up as-is.
  3. GaxError.cause should be better documented.
  4. We should be aware that Exception Chaining in Python 3 will help things, and make sure our code doesn't obscure that.

As an aside, I super hate GaxError just on name alone. I feel like if I'm using google.cloud.something, I expect custom exceptions to be derived from something in google.cloud, maybe google.cloud.exceptions.GoogleCloudError. This is one of the many reasons I'd like to see gax and google-cloud-core converge, gax is a really unfortunate name for usability.

theacodes avatar Mar 02 '17 19:03 theacodes

@geigerj @jonparrott Okay, this change has gotten more complex, but I think it is for the better for your input. That said, I want to run this through a little more real world testing before merging, since I had to modify the unit tests pretty drastically.

Assigned to @geigerj for review, but I want to justify a couple things in advance (and, of course, pushback is welcome if you think I am wrong):

(1) I went from mocking exc_to_code to creating a RpcError subclass which accepts a code and handles exc_to_code correctly. The reason for this was to make it a little easier to reason about how our cases have changed: we know recognize that we might get RpcError instances (which have a .code() method and unrelated exceptions (which do not). I am up for considering alternative implementations, but this one seems easiest to grok.

(2) I changed every mock_call variable in test_retry.py to mock_func. This was to get past something that I found personally confusing; MagicMock has a mock_calls attribute that I use often, and it was confusing to me that mock_call was not actually a mock.Call.

(3) CustomException is now used for non-RPC exceptions; e.g. ones that should bubble up.

lukesneeringer avatar Mar 02 '17 21:03 lukesneeringer