router
router copied to clipboard
Improve logging experience
When you need to debug the router or find why you received an error from the router it's often a pain and can take time to discover what's going on. For example if you have to debug an error reported by the router and just wants to know what was the request sent and raw response received by a subgraph, for now it needs to add logs directly in the router codebase.
Another example, sometimes we throw an error in the graphql errors array with less context and cannot find the equivalent error with more context in our logs.
What I would like is to have a simple mechanism to find related logs to a GraphQL error just by grepping our logs. We could add a unique identifier generated for a specific error and provide this ID in extensions
Example:
{
"errors": [
{
"message": "HTTP fetch failed from 'inventory': service 'inventory' response was malformed: service 'inventory' response was malformed: invalid type: string \"/\", expected a sequence",
"path": [
"allProducts",
"@"
],
"extensions": {
"error_id": "123e4567-e89b-12d3-a456-426614174000", // This error code can be used to grep into our logs
"type": "SubrequestHttpError",
"service": "inventory",
"reason": "service 'inventory' response was malformed: service 'inventory' response was malformed: invalid type: string \"/\", expected a sequence"
}
}
]
}
2022-09-20T09:11:22.481856Z ERROR an error occured due to ... error_id="123e4567-e89b-12d3-a456-426614174000"
in logs error_id="123e4567-e89b-12d3-a456-426614174000"
Also we could add a logging section in telemetry to configure if we want to log requests and raw responses from subgraphs, if we want to add some redacting to not leak any sensible information.
telemetry:
logging:
log_on_graphql_error: true # Display a log when we return a GraphQL error
subgraphs:
all:
log_request: false
log_response: false
redact:
- match: secret*
to: REDACTED
my_subgraph:
log_request: true
log_response: true
level: trace
This is great! A few things come to mind immediately:
- We have enough other logging issues that I created a component/logging label for it. 🏷️
- I don't know that we want to couple telemetry and logging coupled together, even if it is true that logs are attached to telemetry spans (which is very nice!). I suspect we might want to build a top-level concept around
loggingin our configuration. 🪵 - Perhaps in a similar vein, we might want to have a top-level concept around data redaction, which could cover various avenues in a consistent way: telemetry, logging, studio, etc. 💼
Oh, could the title of this issue be a bit more specific?: maybe "subgraph request/response logging"?
It's not specific to subgraph request/response logging. It's more global, but yes indeed in this description I was mainly focus on that topic. But for example the field log_on_graphql_error is not related, also the error id is a global thing. I think it's still ok to not specify subgraph in the title.
Ok, fair! Let's leave it how it is for now and then consider breaking this into individual issues in the future.
(In the general sense "Improve logging" is, I think, too generic and I wouldn't want this to turn into an issue we can't concretely close just because it touches on too many things.)
The error_id is an obvious winner. With JSON, arguably better than grep since:
router <blah> | jq '.errors[] | select ( .extensions.error_id == "123e4567-e89b-12d3-a456-426614174000")'
will filter out all the errors with that exact ID.
Will be great also to add other patterns for logging. Currently this goes directly inside stdout and for us that's not a viable option currently since we are using multi-tenant k8s cluster. We decided to pipe the log to a file but that doesn't take care of the file rotation and everything.
Going to close this in favor of #3226