federation icon indicating copy to clipboard operation
federation copied to clipboard

Apollo Gateway Query Plan incorrectly handles union/fragment of sharable Interfaces

Open getglad opened this issue 2 years ago • 10 comments

I am running into a problem when running a query of a shared interface that is also a field on Query.

Essentially, it appears that something is causing confusion in the query plan where the query plan is defaulting to a particular subgraph. If that subgraph does not have the implementation of the type, it drops the union/fragment.

We have built a plugin with logic that can analyze the id and point the query to the correct subgraph, and can get a return on the gateway, but when the gateway returns the the client, the result is null. I assume this is because the gateway is attempting to return a type belonging to the "default" subgraph instead of the correct subgraph and something is failing.

I have seen some recent PRs that impact the query plan - was not sure if this may be related.

  • https://github.com/apollographql/federation/pull/1911
  • https://github.com/apollographql/federation/pull/1859

Also entirely possible that I've misunderstood how to decorate these types/fields, so could be human error.

Happy to provide more details, but didn't want to overwhelm with a wall of output.

Reproducable example

Subgraph 1

extend schema
  @link(url: "https://specs.apollo.dev/federation/v2.0",
        import: ["@key", "@shareable"])

interface Node {
  id: ID!
}

type Query {
  node(id: ID!): Node @shareable
}

Subgraph 2

extend schema
  @link(url: "https://specs.apollo.dev/federation/v2.0",
        import: ["@key", "@shareable"])

interface Node {
  id: ID!
}

type Foobar implements Node {
  id: ID!

  uuid: String!
  version: Int!
  name: String!
  description: String
  category: [String!]
}

type Query {
  node(id: ID!): Node @shareable
}

Supergraph (partial/relevant bits - I can share more if helpful)

enum join__Graph {
  SUBGRAPH1 @join__graph(name: "subgraph1", url: "https://path.com")
  SUBGRAPH2 @join__graph(name: "subgraph2", url: "https://otherpath.com")
}

interface Node
  @join__type(graph: SUBGRAPH1)
  @join__type(graph: SUBGRAPH2)
{
  id: ID!
}

type Foobar implements Node
  @join__implements(graph: SUBGRAPH2, interface: "Node")
  @join__type(graph: SUBGRAPH2)
{
  id: ID!
  uuid: String!
  version: Int!
  name: String!
  description: String
  category: [String!]
}

type Query
  @join__type(graph: SUBGRAPH1)
  @join__type(graph: SUBGRAPH2)
{
  node(id: ID!): Node
}

Query (from client)

query SampleQuery($id: ID!) {
  node(id: $id) {
    __typename
    id
    ... on Foobar {
      uuid
    }
  }
}

QueryPlan (should be subgraph2)

QueryPlan {
  Fetch(service: "subgraph1") {
    {
      node(id: $id) {
        __typename
        id
      }
    }
  },
}

getglad avatar Jun 16 '22 18:06 getglad

Thanks a lot for this issue, we face the same at @productboard I'd be more than happy to take a look at it with someone – explain it maybe even better, because it's complex problem in nature.

We run @apollo/gateway": "0.44.1" in production with no issues with the following architecture.

First of all, this problem is related to the fact we use Node resolver (see the spec) in our infrastructure, and it might be related to implementation (we actually already touched on this in this repository, see https://github.com/apollographql/federation/issues/1067#issuecomment-943276650)

For brevity bit simplified example.

Context

  1. We have "noderesolver" service

interface Node {
  id: ID!
}
type Query {
  node(id: ID!): Node
  nodes(ids: [ID!]!): [Node]!
}

To grasp what it does, see following test

fun `Should handle request`() { 
       val query =
            """{node(id:"MTpOb3RlOjE="){
                __typename 
                ...on Note{__typename id}
                ...on Feature{__typename id}}
                }"""
        webTestClient.post()
            .uri("/graphql")
            .bodyValue(mapOf("query" to query))
            .exchange()
            .expectStatus().isOk()
            .expectBody().jsonContent {
                isEqualTo(
                    """
                {"data":{"node":{"__typename":"Note","id":"MTpOb3RlOjE="}}}
                """
                )
	}
}

The idea here is that every entity is implementing this Node interface and GraphQL Gateway can then based on extracted __typename understand who to ask for it.

  1. To follow-up on the example, we have also "demo-feature" service with following schema
type DemoFeature implements Node @key(fields: "id") {
    id: ID!
    name: String!
    description: String
}

It's quite idiomatic DGS https://netflix.github.io/dgs/ so I won't be sharing any details here.

Example

apollo/gateway version 0.44.1 – correct behaviour:

Input:

query X($nodeId: ID!) {
  node(id: $nodeId) {
    id
    __typename
    ... on DemoFeature {
      name
    }
  }
}

Output:

{
  "data": {
    "node": {
      "id": "MTpEZW1vRmVhdHVyZTo5ZTg1ZGUxMC0xMjY5LTRlNmUtYTBiOC1iN2VhMTQ1YzRkZDc=",
      "__typename": "DemoFeature",
      "name": "smoke-test-feature"
    }
  }
}

Query plan:

QueryPlan {
  Sequence {
    Fetch(service: "noderesolver") {
      {
        node(id: $nodeId) {
          __typename
          ... on DemoFeature {
            __typename
            id
          }
      }
    },
    Parallel {
      Flatten(path: "node") {
        Fetch(service: "demo-features") {
          {
            ... on DemoFeature {
              __typename
              id
            }
          } =>
          {
            ... on DemoFeature {
              id
              name
            }
          }
        },
      },
    },
  },
}

Incorrect behavior with apolo/gateway 2.0.5

Federation 1

Input 1:

query DemoFeatureNode($nodeId: ID!) {
  node(id: $nodeId) {
    id
    __typename 
		... on DemoFeature {
      name
    }
  }
}

Input 2:

query DemoFeatureNode($nodeId: ID!) {
  node(id: $nodeId) {
    id
    __typename 
  }
}

Both returns

{
  "data": {
    "node": null
  }
} 

with query plan:

QueryPlan {
  Fetch(service: "node-resolver") {
    {
      node(id: $nodeId) {
        __typename
      }
    }
  },
}

Federation 2

We tried to upgrade to Federation 2 just for a case it would solve our problem

Input 1:

query DemoFeatureNode($nodeId: ID!) {
  node(id: $nodeId) {
    id
    __typename 
  }
}
{
  "data": {
    "node": {
      "id": "MTpEZW1vRmVhdHVyZTo1MzA2ZjNlMi0wMGFjLTQ0NWItODVjNS1iYTI5MjMwODA3MDE=",
      "__typename": "DemoFeature"
    }
  }
}

However query plan is:

QueryPlan {
  Fetch(service: "node-resolver") {
    {
      node(id: $nodeId) {
        __typename
        id
      }
    }
  },
}

if we then expand on it to make it more real use case with this input:

query DemoFeatureNode($nodeId: ID!) {
  node(id: $nodeId) {
    id
    __typename 
    ... on DemoFeature {
      id
      name
    }
  }
}

we get this output:

{
  "data": {
    "node": null
  }
}

with the same query plan.


I hope those details really help to track the problem down, this effectively blocks us from migration towards Federation 2.

@getglad do I get right we have basically the same problem?

jukben avatar Jun 20 '22 06:06 jukben

@getglad I think the confusion comes from how @shareable should be used. And to be fair, it is easy to get wrong and I had a patch to help catch that kind of errors at #1556, but if falled through the cracks (I'll try to rebase/ressurect it though).

The important point is the 2nd point of the Important considerations for @shareable of the documentation. Namely that if a field is shared, then all subgraph should resolve it the exact same way (meaning, in that case, resolve the same exact entity).

So your example kind of cannot work, because really, Subgraph 1 have no good way to even resolve its node(id:) field in the first place, since the interface Node has no implementation in that subgraph (or more precisely, the only possible resolution would be to always return null).

But the point is that the query planner does rely on that knowledge that both subgraphs must resolve the field the same, or in other words, that using either Subgraph 1 or Subgraph 2 should not matter for correctness, it only matters for performance. So the fact the QP chooses Subgraph 1 is somewhat random: it assumes that it's as good as choosing Subgraph 2 (but infer in particular that node will never return a Foobar since Subgraph 1 cannot, and that's why it doesn't worry further).

And in general, a takeaway here can be that @shareable for entity fields is ever only a performance concern, but it doesn't create new modeing opportunity that you wouldn't have otherwise (to clarify, @shareable on value types is more generally useful/necessary, I'm talking specifically of entity fields).

As said above, the original intent was to reject such example (#1556) as I agree that it's easy to get confused by this.

I also reckon that your intent was probably different from what @shareable permits. I guess I'm not 100% sure what you expected @shareable to do in that example. What's the point of Subgraph 1 for instance, if Subgraph 2 can be queried directly? I assume you have more subgraphs and maybe wanted Subgraph 1 to be able to dispatch to other subgraphs somehow? If so, there is ways to do this with @key (but without @shareable), though it's not perfect as Subgraph 1 would still need to know all the possible implementations of Node.

pcmanus avatar Jun 21 '22 08:06 pcmanus

@jukben I'm not completely sure this is the same issue. Afaict, you don't seem to be using @shareable in particular.

But I understand that you have something that worked in federation 1 and doesn't in federation 2, and that does sound like a potential bug. I could use a bit more detail however to be honest. In particular, I assume your "noderesolver" service schema has more than what shown? It must have a declaration for type DemoFeature in particular, otherwise the query plan you shown for gateway 0.44.1 would have sent an invalid query to "noderesolver".

I'm happy to help track the issue down, but it migth be useful to open a separate issue so we focus on your case, and I think I could use a bit more details (disclaimer: I'm not really familiar with DGS in particular; I can try to become familiar but I don't know when I'll have time to do that. A repro example without DGS would certainly help speed things up on my side in particular if that is something you can provide).

pcmanus avatar Jun 21 '22 08:06 pcmanus

@pcmanus Thanks a lot for your answer! ❤️ I'm afraid this has nothing to do with Federation 2 in general, first step we wanted to do is to upgrade to GraphQL Gateway to the latest version (still using Federation 1) – and that's when things go south. We just tried the lucky shot to turn on Federation 2 composition to see if there's any difference.

This is interesting though because "noderesolver" declares only the schema I send (I mean this is the schema in Schema Registry). What we do though – is compose schema in a runtime of the service to not fail on runtime validation, because this is what Node Resolver will receive

{node(id:"dHlwZUZyb21BU2VydmljZTox"){__typename ...on DemoFeature{id __typename}}}

So we artificially build (on the fly, because node resolver knows nothing about DemoFeature..) that DemoFeature type to be able to resolve it (basically return it as __typename see the example above).

I'm not sure if it's the same issue or not – because we try to do the same if I understand correctly and it's indeed possible in an older version of Gateway. Happy to open a new issue, if you think so. Let me know, please 🙏. Maybe there's a better implementation of Node resolver (Relay Object ID specification, I would expect this is quite common) I'm not aware of? If you can find some time, we can hop on a call and I can show it to you around. The last resort will be repro – but given the complexity, it's quite time-consuming. Maybe there's a better way how simulate this via some tests? 🤷

To my understanding, it has nothing to do with the actual implementation of the subgraph because it never reaches it. What I wanted to say is that we have correctly implemented __resolveReference etc (https://www.apollographql.com/docs/federation/entities/)

jukben avatar Jun 21 '22 09:06 jukben

@jukben gotcha, thanks for the details, that make more sense now :)

First, I'll quickly call out that federation has not particular understanding of the Relay Object ID specification. Instead, federation solely identify entities based on the @key directive. This doesn't mean you cannot build the Relay spec on top of it (within the limitation of federation), but just wanted to clarify that doing so is completely transparent to federation.

Now, let me try to rephrase my understanding of what you're trying to achieve (and please correct any incorrect assumptions): you want "noderesolver" to act as a form of dispatcher event, that given an id essentially give you the corresponding type (the __typename), and to have the query continue to the appropriate subgraph based on that type. Am I butchering this?

From that, I'll note that the "noderesolver" service as you wrote it (with an interface but no implementation) plays fast and loose with the graphQL spec, in the sense that resolving the field node should really return an implementation of Node, but there is no such thing in that service. Not to mention that the service should be able to receive queries with types in it that don't exists.

I understand that you have some "on the fly" logic to make this somehow avoid (normal graphQL) runtime validation, and I don't know the detail of that, but it seems a bit hacky. Fair?

Anyway, the point I'm getting at is that federation does not allow this kind of dynamic "cheating" of the graphQL type system at the moment. In other words, if "noderesolver" dispatches based on the concrete type associated to each ID, then it needs to know the possible concrete types: they need to exist in the schema.

That means that the "noderesolver" service would have to look something like:

interface Node {
  id: ID!
}

type Query {
  node(id: ID!): Node
  nodes(ids: [ID!]!): [Node]!
}

type DemoFeature implements Node {
  id: ID!
}

type Feature implements Node {
  id: ID!
}

// .... same for all implementation types ...

The downside obviously is that you need all those "shim" implementations in the schema. Fwiw, if I had to do that, I'd probably consider writting some super simple script that adds all those shims automatically based on other subgraphs (still, this would have to be called statically pre-composition). I'll note that you probably don't need to implement resolver for any of those.

On the plus side, it does mean that your service is not proper graphQL and you have to work-around runtime validation when you node(X) resolver in "noderesolver" returns "__typename: DemoFeature".

Anyway, I don't know if this helps. Hopefully I didn't completely missed what you're trying to do, but let me know if that's the case.

pcmanus avatar Jun 21 '22 13:06 pcmanus

@pcmanus - thank you for the thorough answer! I do think @jukben and I are working towards the same problem. Your clarifications are helpful. A few follow ups:

First off when you point out:

Subgraph 1 have no good way to even resolve its node(id:) field in the first place, since the interface Node has no implementation in that subgraph ... What's the point of Subgraph 1 ...

You are correct that my example did not have much of anything in Subgraph1. In the real world, we have several implementations of Node in both Subgraph1 and Subgraph2. I was trying to emphasize I was confused by the result of query plan, not indicate behaviors on the subgraph. I was simplifying and maybe went to far :) Sorry for the confusion.

My expectation was that we could use Federation to dispatch to the correct subgraph given the content of the query.

To reiterate, the value of id can also be used to identify which subgraph to direct the query to. We are able to point to the correct subgraph using a small plugin to analyze id and RemoteGraphQLDataSource to set the correct value on request.http.url. We can use experimental_didResolveQueryPlan to see the correct subgraph is returning as expected.

~I am also assuming (and your comments make me think you agree) there is something going on in the final return that is treating the return I see in experimental_didResolveQueryPlan as coming from Subgraph1 which results in null, but I would imagine that problem goes away if Query Plan finds it is a Subgraph2 query instead.~ (Edit: see comment below that this must not be the case, but why changing query plan would be simplify the DX)

You commented

So the fact the QP chooses Subgraph 1 is somewhat random: it assumes that it's as good as choosing Subgraph 2 (but infer in particular that node will never return a Foobar since Subgraph 1 cannot, and that's why it doesn't worry further).

I think this somewhat confirms my suspicion about query planner internals, but I am surprised.

I would propose that if QP can infer that a subgraph cannot return a type and therefore then removes it from the plan, that instead of being random it chooses a subgraph that could return the type.

If it is possible to point me to the general area where QP does this inference I'd be happy to poke around and see if I could make a functional example.

getglad avatar Jun 21 '22 14:06 getglad

So as a temporary work around, I have been able to attach some values in a requestContext.request from a plugin's requestdidstart hook to the requestContext.context property.

When there is a node query, I can then reference those values in the context of the willSendRequest hook of RemoteGraphQLDataSource to alter the request object. The return passes back to the client as expected.

This is obviously a hack, and assumes that id can function as a pointer to the correct subgraph, but it gets query plan out of the way when proxying a node query to a subgraph, and there doesn't appear to be any blockers interfering with the return to the client.

I would prefer that something could be done with query plan to point to a subgraph where this could resolve.

requestdidstart

https://www.apollographql.com/docs/apollo-server/integrations/plugins-event-reference/#requestdidstart

We have more going on in our plugin but to reduce the important part...

<if we identify that there is a node query...>
requestContext.context["proxyTarget"] = <some value we derive from the `id` variable>
requestContext.context["origQuery"] = requestContext.request.query`
requestContext.context["origOperationName"] = requestContext.document.definitions[0].name.value

willSendRequest

https://www.apollographql.com/docs/federation/api/apollo-gateway/#class-remotegraphqldatasource

<if we identify that there is a node query...>
request.http.url = context["proxyTarget"]
request.query = context["origQuery"]
request.operationName = context["origOperationName"]

getglad avatar Jun 21 '22 17:06 getglad

@pcmanus Sweet! Thanks a lot I think we are super close!

Now, let me try to rephrase my understanding of what you're trying to achieve (and please correct any incorrect assumptions): you want "noderesolver" to act as a form of dispatcher event, that given an id essentially give you the corresponding type (the __typename), and to have the query continue to the appropriate subgraph based on that type.

Yes, exactly!

I understand that you have some "on the fly" logic to make this somehow avoid (normal graphQL) runtime validation, and I don't know the detail of that, but it seems a bit hacky. Fair?

Basically, we build schema in runtime which might look like this:

interface Node {
  id: ID!
}

type Query {
  node(id: ID!): Node
  nodes(ids: [ID!]!): [Node]!
}

# The part above is uploaded to Schema Registry (using Rover). Following part is attached on the fly – based on decoded base64 ID to be able to return correct `__typename` without additional hacks.

type DemoFeature implements Node {
  id: ID!
}

Speaking about Schema Registry `DemoFeature is defined in different subgraph (using @key):

type DemoFeature implements Node @key(fields: "id") {
    id: ID!
    name: String!
    description: String
}

Where I'm bit lost is this part:

Anyway, the point I'm getting at is that federation does not allow this kind of dynamic "cheating" of the graphQL type system at the moment.

About which schema do you talk about? Do you suggest that the schema we build in runtime has to be actually uploaded to Schema Registry for the particular "noderesolver" subgraph as well? Because I thought that's the idea of supergraph which is composed as you would expected:

schema @core(feature: "https://specs.apollo.dev/core/v0.1") @core(feature: "https://specs.apollo.dev/join/v0.1") @core(feature: "https://specs.apollo.dev/tag/v0.1") @apollo_studio_metadata(buildId: "4edb8314-f8dd-4025-8b3c-1585a05f4af0", checkId: null, launchId: "4edb8314-f8dd-4025-8b3c-1585a05f4af0") {
  query: Query
  mutation: Mutation
}

directive @core(feature: String!) repeatable on SCHEMA

directive @join__field(graph: join__Graph, requires: join__FieldSet, provides: join__FieldSet) on FIELD_DEFINITION

directive @join__type(graph: join__Graph!, key: join__FieldSet) repeatable on OBJECT | INTERFACE

directive @join__owner(graph: join__Graph!) on OBJECT | INTERFACE

directive @join__graph(name: String!, url: String!) on ENUM_VALUE

directive @tag(name: String!) repeatable on FIELD_DEFINITION | INTERFACE | OBJECT | UNION

interface Node {
  id: ID!
}

type DemoFeature implements Node @join__owner(graph: DEMO_FEATURES) @join__type(graph: DEMO_FEATURES, key: "id") @join__type(graph: DEMO_NOTES, key: "id") {
  id: ID! @join__field(graph: DEMO_FEATURES)
  description: String @join__field(graph: DEMO_FEATURES)
  name: String! @join__field(graph: DEMO_FEATURES)
}

type Query {
  node(id: ID!): Node @join__field(graph: NODERESOLVER)
  nodes(ids: [ID!]!): [Node]! @join__field(graph: NODERESOLVER)
}

directive @apollo_studio_metadata(buildId: String, checkId: String, launchId: String) on SCHEMA

scalar join__FieldSet

enum join__Graph {
  DEMO_FEATURES @join__graph(name: "demo-features", url: "http://graphql-demo-features-service.graphql/graphql")
  NODERESOLVER @join__graph(name: "noderesolver", url: "http://graphql-node-resolver.graphql:80/graphql")
}

Do I try to use something unsupported? Have we somehow unconsciously exploited some "bug" that has been fixed in Apollo Gateway v2? Because that behavior we have present in 0.44 version makes sense to me. 🤔

jukben avatar Jun 21 '22 18:06 jukben

@pcmanus surfacing this 🙏 Is there anything else I can do?

jukben avatar Jul 26 '22 11:07 jukben

this issue has been quiet for quite a while. i don't see anything about dynamic dispatch on the docs website. is this still planned for an upcoming release or is it postponed/ blocked by another issue. a status update would be nice.

JipSterk avatar Jun 05 '23 10:06 JipSterk