router icon indicating copy to clipboard operation
router copied to clipboard

[do not merge] potential fixes for deferred response error handling

Open carodewig opened this issue 6 months ago • 2 comments

This has all unraveled after I pulled an innocuous-looking string. There are two semi-related changes in this PR that should probably be split into separate PRs. But as I'd like some early feedback on my approach to both of them, I'm raising this draft PR.

  • Issue 1: errors from deferred responses aren't propagated out of the execution stage (#2329).
  • Issue 2: the on_graphql_error selector doesn't work supergraph stage (notably for coprocessors, but I suspect this is also true for logging etc).

Issue 1

This arises because of the way we filter errors within split_incremental_response - error_path.starts_with(&path) never returns true (given the error_path and path values I've observed), so all errors are filtered out.

Current state:

error.path = Path([Key("topProducts", None), Flatten(None)])
path = Path([Key("topProducts", None), Index(0)])

error_path.starts_with(&path) => false

I believe the fix for this is to remove the trailing index from path, but I need to test this on more complex deferred queries. My concern is a path like [Key("top"), Index(0), Key("value"), Index(1)] is possible, and I don't know if Index(0) will be present in error.path. (f913ee942042bca11d3e75734fec93774beebbf3)

Issue 2

The on_graphql_error selector currently relies on CONTAINS_GRAPHQL_ERROR within the response context. That value isn't actually set until the telemetry layer of the supergraph stage, so it works for router coprocessors but not for supergraph coprocessors.

I'm generally concerned about using a global context value for on_graphql_error, since once issue 1 is fixed we can surface errors at any point in a deferred response. I'm not yet sure how to handle it at the router stage (where we're dealing with bytes), but at the supergraph stage I think it would be much better to make decisions via response.errors and response.incremental[*].errors rather than relying on CONTAINS_GRAPHQL_ERROR. (a6d801fc887ac1203a9353cb5e22c0c90b2bb033 and 4b7fdcdbbdbbe0f8239f705231f72fb6ea969787)

I don't like my tentative solution of always returning true for on_graphql_error on_response and relying on callers to know to use on_event_response.. but don't have a better idea at the moment.

Bonus Issues / TBD

  • If we make the changes I think we should make for issue 2, we now have different meanings for the on_graphql_error selector at the router and supergraph stages. This will need to either be fixed or be thoroughly documented.
  • Telemetry probably doesn't use on_event_response, so that will need to be updated or else everything will be logged on on_graphql_error: true

Checklist

Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.

  • [ ] Changes are compatible[^1]
  • [ ] Documentation[^2] completed
  • [ ] Performance impact assessed and acceptable
  • Tests added and passing[^3]
    • [ ] Unit Tests
    • [ ] Integration Tests
    • [ ] Manual Tests

Exceptions

Note any exceptions here

Notes

[^1]: It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this. [^2]: Configuration is an important part of many changes. Where applicable please try to document configuration examples. [^3]: Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions.

carodewig avatar May 09 '25 20:05 carodewig