gax-python
gax-python copied to clipboard
Re-raise unrecognized exceptions on retry.
Right now, if our retry logic encounters an exception it does not recognize, it raises RetryError
which wraps the original exception.
This probably is not what we want. The end result is that the RetryError
will end up masking the actual exception that needs to be fixed. The developer can still get at it (by calling .cause.exception()
), but that is not clear or documented anywhere and there is no intuitive reason to try it.
I assert that it would be better to just raise the original exception, in the original context, and reserve RetryError
for errors that the retry logic fundamentally understands the nature of (e.g. hitting the max retry limit).
If accepted, this PR:
- Fixes GoogleCloudPlatform/google-cloud-python#3083
- Closes GoogleCloudPlatform/google-cloud-python#3087 (as no longer necessary)
FWIW, I think the "right" solution to this problem is to
- make sure developers know of the existence of
GaxError.cause
- Intervene at the handwritten google-cloud-python level in cases where we don't want users to encounter a GaxError.
@geigerj Good points, all. :-) I am not sure I am convinced. I need to figure out the right thing to say to https://github.com/GoogleCloudPlatform/python-docs-samples/issues/833, so let's figure out what that is. My best guess is that my change might not be where we end up, but the status quo is probably wrong also.
Simple users of the GAPIC should need to know that the gRPC layer exists. Therefore, they should never encounter gRPC errors directly. They should be able to assert that if an RPC error occurs while calling a GAPIC method, it will be a GaxError.
My gut tells me that this requirement is correct, at least as a design principle. This also has the advantage that if you want to, you can except GaxError
and expect to get most/all things.
I am not as sure that it is correct in "unknown" cases (which is what this is). So, the fundamental exception here is something that our code was not ready for and literally caught with except Exception
. So, the question is, which is better, to send up a "wrapped" version of the exception (with a not-obvious API), or send up the unrecognized exception to the caller?
I could see a pretty strong case that if we get GRPC errors that we should be able to handle and the code is unrecognized, we treat those as bugs that need to be wrapped in a GaxError
and given an appropriate/helpful error.
Bottom line: I agree with you that we should avoid sending up transport-specific errors as a general practice. But, I am less sure that handling unknown cases with a sword this broad is wise.
I am intentionally inviting debate / disagreement here. Tell me I am wrong. :-)
(Meanwhile, advanced users can still access the underlying gRPC error via the attributes of the GaxError.)
Do we document how to do this? (If not, where is the right place where it would be found? Our docs are kind of all over the place right now.)
One should be able to configure any transport layer of GAX without changing its behavior. For example, if we implemented GAX support for HTTP/JSON, a RPC error that occurs during retrying should still be a GaxError.
Again, this sounds right as a design principle; I am unsure it should apply to unknown cases.
The impact of both of these is that gRPC errors are never raised from GAX, since ApiCallables are always wrapped to ensure they always raise GaxErrors when an exception occurs. It is inconsistent only to return the underlying error in the retry case.
I think you are wrong here. As best as I can read this code, to_catch
is a whitelist. It will not catch any old error; it will only catch the errors it knows how to deal with. The proposed change would make retry errors act more like this code, would it not?
Update: If we go with something like my strategy here, I am fairly certain that we would still consider https://github.com/GoogleCloudPlatform/python-docs-samples/issues/833 to be a bug; the bug would be that we are not appropriately catching the code being sent back by the Vision API.
@jonparrott Could you add thoughts on this question? (Social note: I explicitly confirmed with Jacob that adding another voice here would be instructive.)
Summary of offline discussion with @lukesneeringer:
- We should not wrap all exceptions without any knowledge of what we're wrapping here.
- We should aim to preserve the original design intention of raising GaxError in general to indicate API errors, regardless of the transport layer
- There is already a mechanism for doing this used by
api_callable
and configured byconfig.API_ERRORS
- We can reuse this mechanism here to ensure that we are wrapping API errors so that users don't have to worry about transport when writing exception handling code, without wrapping all exceptions
- We should include the string representation of the wrapped exception in the GaxError message so that it is recorded when users log their exceptions
- If the documentation of GaxError.cause isn't sufficient or visible enough, improve it
I agree with where you guys ended up, let me re-iterate:
- Known (non-exceptional) transport exceptions should be wrapped.
- Unknown (exceptional) exceptions, such as KeyboardInterrupt, OSError, etc., should bubble up as-is.
-
GaxError.cause
should be better documented. - We should be aware that Exception Chaining in Python 3 will help things, and make sure our code doesn't obscure that.
As an aside, I super hate GaxError
just on name alone. I feel like if I'm using google.cloud.something
, I expect custom exceptions to be derived from something in google.cloud
, maybe google.cloud.exceptions.GoogleCloudError
. This is one of the many reasons I'd like to see gax and google-cloud-core converge, gax is a really unfortunate name for usability.
@geigerj @jonparrott Okay, this change has gotten more complex, but I think it is for the better for your input. That said, I want to run this through a little more real world testing before merging, since I had to modify the unit tests pretty drastically.
Assigned to @geigerj for review, but I want to justify a couple things in advance (and, of course, pushback is welcome if you think I am wrong):
(1) I went from mocking exc_to_code
to creating a RpcError
subclass which accepts a code
and handles exc_to_code
correctly. The reason for this was to make it a little easier to reason about how our cases have changed: we know recognize that we might get RpcError
instances (which have a .code()
method and unrelated exceptions (which do not). I am up for considering alternative implementations, but this one seems easiest to grok.
(2) I changed every mock_call
variable in test_retry.py
to mock_func
. This was to get past something that I found personally confusing; MagicMock
has a mock_calls
attribute that I use often, and it was confusing to me that mock_call
was not actually a mock.Call
.
(3) CustomException
is now used for non-RPC exceptions; e.g. ones that should bubble up.