rover supergraph compose should not make a separate graphql query for each subgraph

Description

Rover supergraph compose loops and makes a graphql query SubgraphFetchQuery for each subgraph in the supergraph. For supergraphs that have significant number of subgraphs (we have 294 in QA, 277 in PROD), this can take quite a while and is not an efficient mechanism to retrieve the information needed -- and its a waste of the client's time and Apollo API server's resources.

Instead provide and use a query that takes an array input list of graphref & subgraph to retrieve, or retrieve all subgraphs for each graphref mentioned and reuse it. I am unsure in the value of using rayon threads to do this.

Steps to reproduce

Code is here: https://github.com/apollographql/rover/blob/436a677dacec77705a376eb9f8dc482d949704dc/src/command/supergraph/resolve_config.rs#L68-L183

Expected result

For composing supergraphs that have approaching 300 subgraphs, this should take < 10 seconds, no matter how many cpu's a machine has.

Actual result

On our build runners that runs this, they have a single CPU available, therefore this takes almost 4 minutes to run. Additionally our pipelines that invoke supergraph compose fail often, some days more than 1/8 of attempts fail.

Depending on the number of CPUs available on the machine, the time that it takes to perform this step:

RAYON_NUM_THREADS	           time
-----------------    ------------------
 1	                 3:52.58  total
 2	                 2:05.23  total
 4	                 1:12.17  total
 8	                   48.700 total
10	                   43.377 total
16	                   35.506 total
24	                   31.854 total

Environment

Ubuntu Rover 0.23.0 bash Studio plan: enterprise installation method used by our pipelines: npx, but this has been reproduced locally on my mac laptop just using the rover binary.

Jun 27 '24 04:06 theJC

This is the query that is executed for each and every subgraph, in that parallel iterator:

query SubgraphFetchQuery($graph_ref: ID!, $subgraph_name: ID!) {
  variant(ref: $graph_ref) {
    __typename
    ... on GraphVariant {
      subgraph(name: $subgraph_name) {
        url,
        activePartialSchema {
          sdl
        }
      }
      subgraphs {
        name
      }
    }
  }
}

Jun 27 '24 04:06 theJC

Some additional context where these concerns were highlighted and discussed over two years ago but unfortunately the fundamental issue was not truly addressed in https://github.com/apollographql/rover/issues/992 and just masked (temporarily for us) by using the rayon threads.

so, we need to query for all of those. however we could definitely go through and find all of the subgraphs we need and then batch those requests so that one request is performed per supergraph variant. thanks for the report on this, we'll want to get this fixed up for sure.

https://github.com/apollographql/rover/issues/992#issuecomment-1036503371

Jun 27 '24 04:06 theJC

Hey @theJC, thanks for raising this issue. We've been taking a look at this problem recently and think we have a couple solutions coming up that should address this problem.

Firstly, we are looking at replacing the rayon implementation with a tokio-based implementation. In testing, we've been able to reduce a sample graph of 100 subgraphs from 1m15s down to 10-11s by leveraging a better concurrency implementation.

We are also looking at a more efficient strategy for fetching subgraphs as part of supergraph compose and rover dev, as you've mentioned above. We are initially targeting the ability to use a --graph-ref flag with these commands, which would fetch all of the graphs from studio and allow you to override the SDLs that we pulled down with the values in your local supergraph.yaml file.

We will need to do a bit of testing on the last one, just to make sure we're making the best use of our resources across the API and client.

We expect to release the improvements around this in the next month.

Jul 15 '24 16:07 dotdat

howdy, @theJC! #1995 is the branch removing rayon threads in favor of tokio if you'd like to keep tabs on it, and we're getting ready to release the --graph-ref work @dotdat referenced above (soft date is this upcoming Monday if all goes well). We'll have a release candidate with the asyncification of rover in a short while, if you're interested in testing it out to see whether it helps? We'll also be considering other changes like @dotdat mentioned, but this is our first step toward making things a bit more performant when running supergraph and dev

Jul 19 '24 18:07 aaronArinder

#2001 adds batching to fetching subgraphs

Aug 05 '24 17:08 aaronArinder

^ part of what will be the release including #1995; no hard dates yet, but getting closer

Aug 05 '24 17:08 aaronArinder