connect-go icon indicating copy to clipboard operation
connect-go copied to clipboard

Restrict status codes raised by the library

Open emcfarlane opened this issue 2 years ago • 9 comments
trafficstars

Connect-go raises error codes that fall outside what the gRPC spec recommends for a framework: https://github.com/grpc/grpc/blob/master/doc/statuscodes.md

The following status codes are never generated by the library:

  • INVALID_ARGUMENT
  • NOT_FOUND
  • ALREADY_EXISTS
  • FAILED_PRECONDITION
  • ABORTED
  • OUT_OF_RANGE
  • DATA_LOSS

INVALID_ARGUMENT and FAILED_PRECONDITION from a quick grep. I think most of these cases should be changed to Internal errors.

emcfarlane avatar Sep 20 '23 19:09 emcfarlane

Connect returns INVALID_ARGUMENT in some cases. FAILED_PRECONDITION is only used in a test.

I'm very reluctantly okay with changing our use of INVALID_ARGUMENT to INTERNAL and UNAVAILABLE, matching gRPC. It does seem deeply odd to raise the equivalent of a 5xx when the client sends the server garbage data, though. (And this makes the Connect protocol return 404s if the client uses a compression mechanism that the server doesn't support, which is also quite odd.)

akshayjshah avatar Sep 25 '23 04:09 akshayjshah

Clarifying the server errors codes should help users debug errors. Any INVALID_ARGUMENT will come from user code, ie. an invalid field. Mangled payloads I think should be server errors as this is a bug in the library or some other internal issue.

emcfarlane avatar Oct 03 '23 16:10 emcfarlane

Mangled payloads can just as easily come from the client's RPC runtime. I've never seen a REST server treat a mangled JSON request payload as a 5xx - it seems pretty standard to me to assume that mangled payloads are the client's fault (or some middlebox's fault).

akshayjshah avatar Oct 03 '23 21:10 akshayjshah

I've never seen a REST server treat a mangled JSON request payload as a 5xx

Anything but a 400 is unexpected in my book. The gRPC protocol does not use HTTP status codes and is rarely used with JSON. Are we sure we want to forego semantic HTTP status codes for the Connect implementation because of gRPC?

It would be really nice if there was a set of error codes reserved for users. It would help with debugging and it would let a user-facing application decide which error messages to show to the user verbatim. I don't think that restricting codes raised by the library gets us there though: It leaves just 7 codes that are guaranteed to originate from the server application, and it doesn't give any guarantees about the inverse (since a server application can raise any code).

So I wonder if what gRPC does is actually very helpful in practice.

timostamm avatar Oct 04 '23 13:10 timostamm

Ideally it would be a 400 HTTP status but not INVALID_ARGUMENT error code as I do think that implies a user issue not a marshalling error.

If both client and server generate the request a 500 seems reasonable to me as this is a issue with the client/server lib, but yep this isn't RESTful.

emcfarlane avatar Oct 04 '23 17:10 emcfarlane

@jhump and @mattrobenolt Curious about your thoughts on this one.

akshayjshah avatar Oct 12 '23 18:10 akshayjshah

To be explicit, if this is referring specifically to the codes we spit back for the gRPC protocol, and not the connect protocol, I think we should mirror the behavior of grpc-go and what the spec says.

I haven't gone through and audited each use of them and compared to what the spec says, but I think it'd make sense to try and be as compatible to expectations.

Granted, the only real downside I see is this might be considered an API breaking change? But I don't feel that strongly about it.

As a quick run through our/my uses, we don't do any checks that particularly compare the status codes. Anything not-OK is usually treated the same, and typically it's either pass/fail, with the code and status as just effectively metadata.

mattrobenolt avatar Oct 12 '23 18:10 mattrobenolt

I agree with Timo, that anything other than a 400 is bad for the Connect protocol.

But, TBH, we should perhaps make an appeal to the gRPC team to change the spec, because using "internal" for gRPC error codes is also bad for the same reason: it makes observability and alerting a nightmare because you can no longer correctly distinguish between "errors caused by a bad client" and "errors caused by my server".

I've never worked at a place where dev ops didn't distinguish these cases for alerting and exception-reporting config. So having a client-induced error reported as "internal" is bad. With HTTP/REST, you'd generally distinguish by classifying status codes as 4xx vs 5xx. With gRPC, the best you can do is to statically classify each gRPC error code as either "client" or "server", and this breaks down when a code like "internal" can mean either.

jhump avatar Oct 13 '23 12:10 jhump

Interestingly gRPC clients that receive a 400 response from a service map to INTERNAL as per: https://github.com/grpc/grpc/blob/master/doc/http-grpc-status-mapping.md I'd argue UNKNOWN codes are more severe then INTERNAL in most cases.

OTEL recently adjusted the gRPC status codes to map closer to HTTP semantics: https://github.com/open-telemetry/opentelemetry-specification/pull/3333 From the PR:

  • on invalid grpc_timeout returns 400 and a codes.Internal error: https://github.com/grpc/grpc-go/blob/a02aae6168aa683a1f106e285ec10eb6bc1f6ded/internal/transport/handler_server.go#L90-L92

emcfarlane avatar Oct 26 '23 18:10 emcfarlane

This is fixed with the conformance tests and the spec update for error codes.

emcfarlane avatar Aug 07 '24 15:08 emcfarlane