router icon indicating copy to clipboard operation
router copied to clipboard

subgraph retry policy and circuit breaking

Open Geal opened this issue 3 years ago • 2 comments

we should be able to specify a timeout for queries to a subgraph, and different techniques to retry queries or stop sending traffic temporarily to a subgraph.

While those options are set globally for a subgraph, we should take care of maintaining a list of hosts for that subgraph with query timing and success statistics per host.

Geal avatar Jan 18 '22 17:01 Geal

There are some existing Tower layers that might be worth looking at here. They may or may not be suitable, be GraphQL-aware, particularly if there are non-retryable properties. (In those cases, perhaps we can wrap one of the Tower layers that offers functionality and provide it the knowledge it needs)

(This could also relate to the fetcher Error-handling conversation.)

abernix avatar Feb 14 '22 09:02 abernix

Follows up #1347

abernix avatar Aug 12 '22 08:08 abernix

adding a retry would fix #1956

Geal avatar Oct 26 '22 13:10 Geal

So I looked into the tower retry layer, and along with the issue of recognizing if a query is retryable (easy to decide on HTTP errors, less so for graphql ones), I am encountering architecture issues with plugins, the same I had in https://github.com/apollographql/router/pull/1889: the layer will need the underlying service to be clonable.

In that PR, I solved it by moving the query deduplication layer out of the traffic shaping plugin, and instead applying it directly at the subgraph service creation. Then plugins like the traffic shaping are applied above that, taking as argument and returning a (non clonable) BoxService. I was able in that PR to keep the deduplication configuration in the traffic shaping configuration though.

I could do the same for retries, but we will encounter issues with plugin ordering:

  • a subgraph request would go through -> timeout -> rate limit -> -> retry -> deduplication -> subgraph (don't know where circuit breaking would fit here)
  • while what we actually need is -> timeout -> deduplication -> retry -> rate limit -> circuit breaking -> subgraph

So I would like to remove the subgraph plugin part of the traffic shaping plugin and move it to the subgraph service, or better, add a more generic internal trait that can be implemented by some plugins, to add layers to the subgraph and other services, without working on a BoxService. Again, that would still use the same traffic shaping configuration section, but dispatch the code elsewhere

Geal avatar Oct 26 '22 13:10 Geal