router
router copied to clipboard
subgraph retry policy and circuit breaking
we should be able to specify a timeout for queries to a subgraph, and different techniques to retry queries or stop sending traffic temporarily to a subgraph.
While those options are set globally for a subgraph, we should take care of maintaining a list of hosts for that subgraph with query timing and success statistics per host.
There are some existing Tower layers that might be worth looking at here. They may or may not be suitable, be GraphQL-aware, particularly if there are non-retryable properties. (In those cases, perhaps we can wrap one of the Tower layers that offers functionality and provide it the knowledge it needs)
(This could also relate to the fetcher Error-handling conversation.)
Follows up #1347
adding a retry would fix #1956
So I looked into the tower retry layer, and along with the issue of recognizing if a query is retryable (easy to decide on HTTP errors, less so for graphql ones), I am encountering architecture issues with plugins, the same I had in https://github.com/apollographql/router/pull/1889: the layer will need the underlying service to be clonable.
In that PR, I solved it by moving the query deduplication layer out of the traffic shaping plugin, and instead applying it directly at the subgraph service creation. Then plugins like the traffic shaping are applied above that, taking as argument and returning a (non clonable) BoxService. I was able in that PR to keep the deduplication configuration in the traffic shaping configuration though.
I could do the same for retries, but we will encounter issues with plugin ordering:
- a subgraph request would go through
-> timeout -> rate limit -> -> retry -> deduplication -> subgraph (don't know where circuit breaking would fit here) - while what we actually need is
-> timeout -> deduplication -> retry -> rate limit -> circuit breaking -> subgraph
So I would like to remove the subgraph plugin part of the traffic shaping plugin and move it to the subgraph service, or better, add a more generic internal trait that can be implemented by some plugins, to add layers to the subgraph and other services, without working on a BoxService. Again, that would still use the same traffic shaping configuration section, but dispatch the code elsewhere