router icon indicating copy to clipboard operation
router copied to clipboard

Router produces 2 errors per request when subgraph response fails and content-type is not application/json

Open nathanmarcos opened this issue 1 year ago • 8 comments

Describe the bug Hello, I'm not sure if this was changed intentionally or it is a regression.

Given a subgraph that responds 5xx with content-type: text/plain, router produces the following error:

  • On version v1.23.0:
    • HTTP fetch failed from 'products': 503: Service Unavailable

And the following errors:

  • From version v1.24.0 until v1.26.0:
    • HTTP fetch failed from 'products': 503: Service Unavailable
    • HTTP fetch failed from 'products': subgraph didn't return JSON (expected content-type: application/json or content-type: application/graphql-response+json; found content-type: text/plain; charset=utf-8

When the subgraph is unavailable or the request times out due to a circuit breaker, we cannot control its response content type to always be application/json or its response body to contain a valid json with the data/errors attributes. Therefore, generating these two types of errors seems like a waste of resources since they do not add any extra value that helps us debug the issue.

To Reproduce Steps to reproduce the behavior:

  1. Setup router
  2. Set up a subgraph that responds 5xx with content-type: text/plain
  3. Shoot a query that hits the respective subgraph

Expected behavior Only one error is returned as it was on version 1.23.0.

nathanmarcos avatar Aug 15 '23 10:08 nathanmarcos

This relates to https://github.com/apollographql/router/issues/2687 and was intentional because we were swallowing some error messages before. What was the impact to you? Was this just something you observed and were curious about?

abernix avatar Aug 21 '23 10:08 abernix

Hello @abernix, we are constantly monitoring the application logs to ensure its health and find possible error messages that can be resolved to reduce log volume. We noticed extra messages in there and traced it down to router’s version 1.24.0. The application works fine, but unfortunately, these extra errors seem like a waste of resources since they do not add any extra value that helps us debug the issue. This reflects in extra 2x more storage usage for those logs and 2x more data transfer between router and the logs backend services and router and its clients.

Ideally, we would like to bring back the previous behavior. We could do that via string matching on a plugin, but that would be very error-prone. So I opened this issue to see if we could get it “fixed” at the root.

nathanmarcos avatar Aug 21 '23 12:08 nathanmarcos

I could get behind not specifying an error if the content type is not json/graphl and HTTP 200 is returned. We are also attaching the body of the response if this exists as a separate error.

Suggest that all errors for non-graphql responses are attached as extensions rather than separate errors.

BrynCooke avatar Aug 21 '23 14:08 BrynCooke

Hello @BrynCooke , is there any progress on this?

We keep seeing our logs flooded with useless errors.

Currently running version 1.33.2.

nathanmarcos avatar Oct 26 '23 18:10 nathanmarcos

Hello @BrynCooke and @abernix , is there any update on this one? We are currently running version 1.43.0.

nathanmarcos avatar Mar 25 '24 08:03 nathanmarcos

This issue is happening for both 4xx and 5xx errors returned by the subgraphs.

nathanmarcos avatar Mar 25 '24 08:03 nathanmarcos

We have been working on a design to generally overhaul error handling in the router. The design includes work to ensure that the errors that all errors are documented, errors are not reused inappropriately, extensions contains all the information that is required rather than expanding to multiple errors, router to subgraph error handling is covered as well as client to router.

A poc was recently worked on https://github.com/apollographql/router/pull/4788 but we are now getting agreements from relevant teams before moving forward.

I realise that this has been hanging around for a long time, but we are close to getting agreement and we don't want to change the error handling only to change it again in the near future.

BrynCooke avatar Apr 04 '24 08:04 BrynCooke

Thanks for the update @BrynCooke. I hope we can get some progress on that soon.

nathanmarcos avatar Apr 10 '24 12:04 nathanmarcos