router icon indicating copy to clipboard operation
router copied to clipboard

Improve logging experience

Open bnjjj opened this issue 3 years ago • 5 comments

When you need to debug the router or find why you received an error from the router it's often a pain and can take time to discover what's going on. For example if you have to debug an error reported by the router and just wants to know what was the request sent and raw response received by a subgraph, for now it needs to add logs directly in the router codebase.

Another example, sometimes we throw an error in the graphql errors array with less context and cannot find the equivalent error with more context in our logs.

What I would like is to have a simple mechanism to find related logs to a GraphQL error just by grepping our logs. We could add a unique identifier generated for a specific error and provide this ID in extensions

Example:

{
  "errors": [
    {
      "message": "HTTP fetch failed from 'inventory': service 'inventory' response was malformed: service 'inventory' response was malformed: invalid type: string \"/\", expected a sequence",
      "path": [
        "allProducts",
        "@"
      ],
      "extensions": {
        "error_id": "123e4567-e89b-12d3-a456-426614174000", // This error code can be used to grep into our logs
        "type": "SubrequestHttpError",
        "service": "inventory",
        "reason": "service 'inventory' response was malformed: service 'inventory' response was malformed: invalid type: string \"/\", expected a sequence"
      }
    }
  ]
}
2022-09-20T09:11:22.481856Z ERROR an error occured due to ... error_id="123e4567-e89b-12d3-a456-426614174000"

in logs error_id="123e4567-e89b-12d3-a456-426614174000"

Also we could add a logging section in telemetry to configure if we want to log requests and raw responses from subgraphs, if we want to add some redacting to not leak any sensible information.

telemetry:
  logging:
    log_on_graphql_error: true # Display a log when we return a GraphQL error
    subgraphs:
      all:
        log_request: false
        log_response: false
        redact:
          - match: secret*
            to: REDACTED
      my_subgraph:
        log_request: true
        log_response: true
        level: trace

bnjjj avatar Sep 20 '22 09:09 bnjjj

This is great! A few things come to mind immediately:

  1. We have enough other logging issues that I created a component/logging label for it. 🏷️
  2. I don't know that we want to couple telemetry and logging coupled together, even if it is true that logs are attached to telemetry spans (which is very nice!). I suspect we might want to build a top-level concept around logging in our configuration. 🪵
  3. Perhaps in a similar vein, we might want to have a top-level concept around data redaction, which could cover various avenues in a consistent way: telemetry, logging, studio, etc. 💼

abernix avatar Sep 20 '22 09:09 abernix

Oh, could the title of this issue be a bit more specific?: maybe "subgraph request/response logging"?

abernix avatar Sep 20 '22 09:09 abernix

It's not specific to subgraph request/response logging. It's more global, but yes indeed in this description I was mainly focus on that topic. But for example the field log_on_graphql_error is not related, also the error id is a global thing. I think it's still ok to not specify subgraph in the title.

bnjjj avatar Sep 20 '22 09:09 bnjjj

Ok, fair! Let's leave it how it is for now and then consider breaking this into individual issues in the future.

(In the general sense "Improve logging" is, I think, too generic and I wouldn't want this to turn into an issue we can't concretely close just because it touches on too many things.)

abernix avatar Sep 20 '22 09:09 abernix

The error_id is an obvious winner. With JSON, arguably better than grep since:

router <blah> | jq '.errors[] | select ( .extensions.error_id == "123e4567-e89b-12d3-a456-426614174000")'

will filter out all the errors with that exact ID.

garypen avatar Sep 20 '22 15:09 garypen

Will be great also to add other patterns for logging. Currently this goes directly inside stdout and for us that's not a viable option currently since we are using multi-tenant k8s cluster. We decided to pipe the log to a file but that doesn't take care of the file rotation and everything.

piclemx avatar Jun 16 '23 18:06 piclemx

Going to close this in favor of #3226

BrynCooke avatar Jun 20 '23 10:06 BrynCooke