grpc icon indicating copy to clipboard operation
grpc copied to clipboard

Clarification on Handling Raised RPCErrors as Crash Events

Open ksanderer opened this issue 6 months ago • 7 comments

Hello Team!

Firstly, thank you for the recent update and the introduction of the crash report logger metadata. This feature is indeed beneficial for debugging and monitoring purposes (related PR and Issue: #381, #377)

However, I have a concern, or perhaps I have misunderstood something regarding the current implementation where all raised errors are considered crash events. Based on the logs, it appears that even standard gRPC errors, such as bad_request or not_found, are treated as crash events. Here is an example:

13:12:02.786 [error] ** (GRPC.Server.Adapters.ReportException) Exception raised while handling /api.messages.v1.MessageService/UpdateMessage:
** (GRPC.RPCError) Message with provided id doesn’t exist.
...

%GRPC.Server.Adapters.ReportException{
    kind: :error, 
    reason: %GRPC.RPCError{
        status: 5,  # GRPC.Status.not_found()
        message: \"Message with provided id doesn’t exist.\"}, 
    ...
}

These are "normal" gRPC errors supported by the protocol, indicating statuses like bad_request or not_found. Treating these as crash events might lead to unnecessary noise in the logs and potentially obscure actual issues that need attention.

By "normal," I refer to the Google APIs guidelines:

Because most Google APIs use resource-oriented API design, the error handling follows the same design principle by using a small set of standard errors with a large number of resources. For example, instead of defining different kinds of "not found" errors, the server uses one standard google.rpc.Code.NOT_FOUND error code and tells the client which specific resource was not found.

Google Cloud API Error Model

This also applies to errors in "Standard methods":

If the Create method supports client-assigned resource name and the resource already exists, the request should either fail with error code ALREADY_EXISTS or use a different server-assigned resource name and the documentation should be clear that the created resource name may be different from that passed in.

Standard methods - Create

According to the documentation, the only way to propagate google.rpc.Status is by raising a GRPC.RPCError.

At my workplace, we adhere to the approach outlined in the documentation, as demonstrated below:

  def delete_message(%DeleteMessageRequest{organization_id: ""}, _stream) do
    raise GRPC.RPCError,
      status: GRPC.Status.invalid_argument(),
      message: "`organization_id` is required"
  end

Questions:

  1. Rationale: Could you please clarify the rationale behind treating all errors as crash events?
  2. Strategy: What would be the proper strategy to handle these errors or suppress these crash reports for standard gRPC errors?

Looking forward to your insights on this matter. Thank you for your time and assistance.

ksanderer avatar Aug 08 '24 15:08 ksanderer