federation icon indicating copy to clipboard operation
federation copied to clipboard

Interface fragment expansion causes conflicting field types

Open MarcPorciuncula opened this issue 2 years ago • 5 comments

I have an interface with a field where there is a nullable field for which one implementer has the field marked nullable and another implementer has the field is marked non-nullable. (I've changed the names of the types from the originals, but the concept is the same).

interface Post {
  likes: Int
}

type Tweet implements Post {
  likes: Int!
}

type Retweet implements Post {
  # A quote retweet can have likes but a normal retweet does not
  likes: Int
}

type Query {
  post(id: ID!) Post
}

When querying against the interface, the field is nullable regardless of the returned type.

query GetPost($id: ID!) {
  post(id: $id) {
    likes
  }
}

It seems that when receiving a query like this, the gateway will use information about the implementers of the interface to expand the fields under post into fragments like this:

post(id: $id) {
  __typename
  ... on Tweet {
    likes
  }
  ... on Retweet {
    likes
  }
}

In this query the type of post.likes could be Int or Int!. This results in a GraphQL validation error because there are two different types trying to be bound to the same field.

Fields "likes" conflict because they return conflicting types "Int!" and "Int". Use different aliases on the fields to fetch both if this was intentional.

I believe the way gateway expands interface fields into inline fragments is incorrect or problematic as it causes problems like these. If I ran the source query directly on the subgraph it would have no problem but the changes the gateway makes result in an invalid query.

Potential solutions would be to not expand interface fields into inline fragments, or to give the common field in the fragments different aliases (as suggested by the error message), then rename them to the original name before returning the query.


Here are the packages I used

Package Version
@apollo/gateway (gateway) 0.43.1
apollo-server (gateway) 3.5.0
@apollo/subgraph (subgraph) 0.1.4
apollo-server-koa (subgraph) 3.4.0
graphql (subgraph) 15.7.2

MarcPorciuncula avatar Dec 01 '21 05:12 MarcPorciuncula

Thanks for the report. And you are obviously correct, this is an issue.

Unfortunately, for the current 0.x versions of federation, I don't think there is an easy fix:

  • expanding interface into their runtime types is fairly ingrained into how the query planner work (and needed in many cases), and changing it is further made harder by the fact that intermediate representation generated by composition (what we call the supergraph) lacks (in those current versions) enough information about interface to do so.
  • using aliases and renaming afterwards is also involved given the implementation (the renaming afterwards mostly). It'll require changing the format used for query plans in particular.

And so, because the 0.x line is now essentially in "maintenance" mode, I think this will remain a limitation of those 0.x version (excluding someone coming up with an alternative fix that is simple enough to be confidently put in a maintenance release). Essentially, you should ensure that interface implementations use the exact same types than in the interface to avoid running into this problem.

The brighter news is that the upcoming major version of federation (version 2, currently in alpha) already does better here. For instance, in the example of the description, querying the interface field would not expand into fragments internally and you won't run into this issue.

The slightly less good news is that this doesn't mean that this is fully fixed in federation 2 (yet), because there is cases where expanding the interface is kind of necessary. And that's because in federation, you could have one implementation where a field is resolved "locally" but another implementation where that same field is external and resolved by another subgaph. Which imply internally making different kind of queries depending on the concrete implementation (that's why current federation always does this btw).

Anyway, I believe that means you can still run into this problem in federation 2, though an example of that is a tad more involved (and so hopefully you are less likely to run into it in practice). Thankfully I think this can be reasonably fixed (by making the implementation a bit more judicious regarding what is expanded and what isn't), but I need to take some time to double check. So stay tuned.

pcmanus avatar Dec 01 '21 11:12 pcmanus

Thankfully I think this can be reasonably fixed (by making the implementation a bit more judicious regarding what is expanded and what isn't), but I need to take some time to double check.

Well, I double-checked and I that idea of mine unfortunately doesn't work.

So using aliases is the only idea that I can see working properly at the moment. And as mentioned above, while doing so is relatively simple conceptually, I suspect the implementation is somewhat involved: the gateway will need a way to know when aliases are used in this manner to merge the responses back, and either we use a special naming scheme for those aliases, but we'd have to make sure it doesn't conflict with user aliases, or we add something new to query plans, and it's not obvious how this could look like.

Anyway, I really want this fixed eventually, but to set expectations, this may take a little while to get prioritized given current priorities and the facts that:

  1. it's not an easy fix
  2. it's less often an issue with federation 2 (again, the example of the description works just fine in federation 2; where this doesn't work is when some of the implementation involves an @external field or a @requires)
  3. there is a fairly simple work-around: you can ensure all implementations use the same type than the interface. Granted, it loses precision, but I'd suspect it's rarely a blocker per-se.

But in the meantime, what I would suggest is to actually detect the cases that are problematic early and to throw a meaningful error, instead of failing at runtime and letting users figure it out. Adding such validation is comparatively fairly simple and I've pushed a PR for this at #1318.

pcmanus avatar Dec 16 '21 13:12 pcmanus

Hey @pcmanus thanks for investigating I really appreciate the effort. Just wante to give an update on my side for you or anyone else experiencing the same problem.

As a workaround we looked into adjusting the type implementing the interface with the non-nullable field so that the field is nullable, at the cost of having to find and update all consumers of that type. After investigating all consumers we were fortunate enough in our particular case to be able to remove the consuming code completely.

I think the biggest risk (and certainly the biggest impact for my company when we experienced it) is the fact that this problem can arise without warning when porting to federation. So I think adding a meaningful error message is a great start. Maybe even a note or warning in the federation docs could be helpful too.

MarcPorciuncula avatar Dec 20 '21 04:12 MarcPorciuncula

For the record, I'd like to note that the PR I pushed (#1318) does not handle all cases, and that detecting all cases is probably quite involved.

The reason is that the patch essentially check, for each interface field, if we may need to "type-explode" for that field, which federation 2 avoids unless there is an @external involved.

But sometimes we may type-explode for a field where the type is the same everywhere, but fetching that field may require other fields (keys for one, but @require in other subgraphs can be another) and some of the versions of those fields may have type mistmatches, leading to invalid queries.

And it's pretty hard to validate out because this cannot be validated on a single subgraph, but rather depends essentially on which @key and @requires are defined everywhere.

Additionally, this can happen with unions, not just interfaces. Consider the following subgraph:

type Query {
  u: U
}

union U = A | B

type A @key(fields: "id") {
  id: Int
  f: Int @external
  g: Int
}

type B @key(fields: "id") {
  id: Int
  f: Int @external
  g: String
}

And consider we do the following query:

{
  u {
    ... on A {
      f
    }
    ... on B {
      f
    }
  }
}

This look inoccuous because f is of the same type in both A and B. But if the other subgaph is:

type A @key(fields: "id") {
  id: Int
  f: Int @requires(fields: "g")
  g: Int @external
}

type B @key(fields: "id") {
  id: Int
  f: Int @requires(fields: "g")
  g: String @external
}

Then the query to the first subgraphs would actually include g for both types because it is in both cases required by f. In other word, the 1st sub-query generated is:

{
  u {
    ... on A {
      __typename
      id
      g
    }
    ... on B {
      __typename
      id
      g
    }
  }
}

but that is invalid due to g's types.

And it's quite hard to validate why the example should be rejected, because you cannot infered it from either subgraph individually.

So I still suggest we finish reviewing and commit #1318, because it's ready and it's better than nothing, but we definitively need to find cycles for fixing this properly.

pcmanus avatar Jan 25 '22 17:01 pcmanus

I wanted to sum up where we are on this issue since there's been a bit of back-and-forth on my part, and even a related PR merged, and this might have troubled the water.

The problem

First, the problem: graphQL specifies that if 2 fields within a selection set have the same response name, then they must have the exact same type. So if you have some field t of type T and:

{
  t: {
    ... on A {
      x
    }
    ... on B {
      x
    }
  }
}

then both A.x and B.x must have the same exact type. And there is a bunch of ways such query may be invalid in such a way:

  • T may be a union comprising A and B, in which case both x field can be completely unrelated (it's the example of my previous comment).
  • T may be an interface of A and B, but x no being in T and thus both x being unrelated fields that happens to have the same name.
  • T may be an interface of A and B and contain x, but because implementation are allowed to use subtypes of the interface type for implementing field, both x type may still differ (it's the case of the initial description.).

Now, if a user does this, it's invalid and it needs to use an alias for at least one of the x in the query above.

However, there is cases where the query planner can generates queries that fall into the invalid "pattern" we just describe, even if the original user query does not have that patter. Afaict, there is 2 main reasons for this:

  1. type-explosion of interfaces. This is again the case in the original issue description. As said previously, this case will affect less use cases then in fed 1 because while fed 1 always type-exploded interfaces, fed 2 only does it in a more restricted subset of cases. Nonetheless, this may still affect fed 2.
  2. non-queried fields added by the query planner: those are either @key fields or @require fields that, even when not in the original query, may be added to subgraph queries by the query planner. When the QP does so, it currently doesn not validate that this doesn't create the invalid pattern described above, and so well, this can happen in some specific cases.

Potential solutions

As mentioned previously, the only proper solution to this is issue is to ensure that the query planner does what any user would do faced with this pattern: it should use aliases in the subgraph fetches to ensure that type with conflicting names ends up having different response names.

However, while the principle is simple, I think the concrete implementation is a bit of effort.

First, this isn't a query planner only change. Even assuming the QP knows to add aliases in the proper cases to make the subgraph query valid, it means that some data in the result of that fetch will essentially have the wrong name (the alias, instead of the actual field name). So the execution part of the query planner would need to "transform" the returned subgraph data, renaming back the alias field to their original name, and this before merging the subgraph data to the in-memory result data. Which probably means that the query plan for such fetches should list a number of post-query rewrites that needs to be performed by execution (which means an addition to the query plan format in particular). And all that needs to account for the fact that the original query could may have aliases in the first place (at least for the type-explosion case; for the case of non-queried @key/@require fields added by the QP, that's not an issue).

On the side of the query planner implementation, there is also a few questions. In particular, is it easier to detect the cases where we must add an alias and only add it then, or is it easier to add aliases more routinely, even if it's not always useful? To be fair, we probably want to do the former mostly because the later would probably change tons of existing query plans and this could be a scare for users on upgrade, but unsure said former is the simpler/most efficient option.

So that's my "summary" (I'm bad at this, right!?). And taking the time to lay this out because as it is not a trivial chunk of work, I'm not yet sure when this will raise to the top of the TODO list for the good folks here at Apollo, so if someone help feels like scratching that itch in the meantime, this might help that someone get started.

pcmanus avatar Jun 21 '22 15:06 pcmanus