router Remove legacy validation

Fix #4407 Fix #4409 This removes GraphQL validation from the query planner, to use the Rust version instead. Validation has now moved to the query analysis layer, which means we can remove a lot of code that was there to accommodate parsing and validating the query but not rejecting it outright, because we needed to compare the validation result with the planner's.

Remaining issues:

[x] usage reporting: since invalid queries are rejected in query analysis, inside the router service, then usage reporting for invalid queries is no longer reported, because the telemetry plugins handles usage reporting at the supergraph response level
[x] persisted queries: apparently persisted queries use the query analysis transformation of supergraph requests, now one test fails on an invalid query, I have not yet looked at why
[x] @defer directive: a test is failing because the directive is used in the query but not found by the schema. @lrlna what is apollo-compiler expecting here? Is @defer considered builtin, or does it need to be declared in the schema?

Checklist

Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.

[ ] Changes are compatible[^1]
[ ] Documentation[^2] completed
[ ] Performance impact assessed and acceptable
Tests added and passing[^3]
- [ ] Unit Tests
- [ ] Integration Tests
- [ ] Manual Tests

Exceptions

Note any exceptions here

Notes

[^1]: It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this. [^2]: Configuration is an important part of many changes. Where applicable please try to document configuration examples. [^3]: Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions.

Jan 26 '24 13:01 Geal

@Geal, please consider creating a changeset entry in /.changesets/. These instructions describe the process and tooling.

Jan 26 '24 13:01 github-actions[bot]

CI performance tests

[ ] events_big_cap_high_rate - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity
[ ] events_without_dedup - Stress test for events with a lot of users and deduplication DISABLED
[ ] events - Stress test for events with a lot of users and deduplication ENABLED
[ ] large-request - Stress test with a 1 MB request payload
[x] step - Basic stress test that steps up the number of users over time
[ ] xlarge-request - Stress test with 10 MB request payload
[ ] reload - Reload test over a long period of time at a constant rate of users
[ ] no-graphos - Basic stress test, no GraphOS.
[ ] events_callback - Stress test for events with a lot of users and deduplication ENABLED in callback mode
[ ] xxlarge-request - Stress test with 100 MB request payload
[ ] step-jemalloc-tuning - Clone of the basic stress test for jemalloc tuning
[ ] events_without_dedup_callback - Stress test for events with a lot of users and deduplication DISABLED using callback mode
[x] const - Basic stress test that runs with a constant number of users
[ ] events_big_cap_high_rate_callback - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity using callback mode

Jan 26 '24 13:01 router-perf[bot]

@defer directive: a test is failing because the directive is used in the query but not found by the schema. @lrlna what is apollo-compiler expecting here? Is @defer considered builtin, or does it need to be declared in the schema?

@Geal we do not define it as a built-in in apollo-compiler. The supergraphs that support defer, will have a defer definition already defined. And it looked like a bunch of tests ^[1] ^[2] ^[3] in the router already had them defined as part of the schema. I assume this test also needs it defined.

Jan 29 '24 09:01 lrlna

alright, that's an easy fix then

Jan 29 '24 09:01 Geal

As an FYI, I know @goto-bus-stop had a change they need to run in our dual mode for a couple of days before we strip this out completely.

Feb 05 '24 11:02 lrlna

alright, I'm holding off on merging it for now

Feb 05 '24 11:02 Geal

@lrlna @goto-bus-stop is it ok to merge now?

Feb 07 '24 10:02 Geal

We still need to approve & release the apollo-compiler validation fix and get it out in a public router release (via https://github.com/apollographql/router/pull/4510, just a patch version), and then let it bake for a few days and make sure the metrics are 100%. Simon would be the best to review it but he's out today, I expect we'll have to wait a week before we can remove legacy validation entirely

Feb 07 '24 10:02 goto-bus-stop

updated to the latest dev now that #4510 is merged

Feb 15 '24 15:02 Geal

@goto-bus-stop @SimonSapin @lrlna is there anything still blocking this?

Feb 22 '24 13:02 Geal

@Geal let me take a look at the metrics real quick

Feb 22 '24 14:02 lrlna

@geal we haven't had any hiccups in the metrics since 1.40.1 released on monday. @goto-bus-stop the harness was matching with both implementations, yea?

Feb 22 '24 14:02 lrlna

I am not sure if we have metrics from the new version yet specifically from the operations that were affected before (it's at least not in the nightly analytics rollup data yet). If we do, then we are good to go

Feb 22 '24 15:02 goto-bus-stop

The option should also be removed from documentation: https://github.com/apollographql/router/blob/23656eac8beba4458f3e994538cf297403edc73c/docs/source/configuration/overview.mdx#L801-L811

Feb 22 '24 15:02 goto-bus-stop

@goto-bus-stop nice catch, thanks 746af65

Feb 22 '24 15:02 Geal

Marked as approved to mean I don’t need to approve again after this change, but I do think the parse error thing should be changed.

Apr 11 '24 07:04 SimonSapin

We are using a plugin defining a supergraph_service . Before this change, this method was called, we could do some request validation in it. After this change, it seems that this method is not called when the GraphQL query is invalid. Can you confirm that this PR could change this behavior? Is it intended?

Apr 26 '24 14:04 yanns

@yanns What was the request validation that you were doing at the supergraph service? Something GraphQL-specific?

Apr 30 '24 09:04 abernix

@yanns What was the request validation that you were doing at the supergraph service? Something GraphQL-specific?

There are 2 use-cases:

(1) as we have a public GraphQL API, we need to translate some error code. So we need a plugin that translates the error codes emitted by the router to "our" error codes. If the router rejects invalid GraphQL queries without calling plugins, we cannot modify the response.
(2) we are using feature toggles. One GraphQL query does not use exactly the same GraphQL schema as another GraphQL query. So we need to validate requests depending on the GraphQL schema that is valid for this particular request. For this, I don't think that this PR is an issue. We configure the router with a GraphQL schema that has all possible fields. After the router has accepted the query, we validate it again with another GraphQL that is a subset of the one used by the router.

So I guess our main issue is that our plugin cannot adapt the response in case of invalid GraphQL queries.

Apr 30 '24 09:04 yanns

Helllo Team, Really appreciate the work you guys are doing. I have a use-case where I don't want any basic validation on router level. For that I did some changes in apollo-compiler and router to ignore validation error on rust , and for the most part it is working. But for some queries, fragment types in particular are still giving validation error. After exploring I have found out that these validation checks are coming from js (legacy). I even tried using experimental_graphql_validation_mode: new in my yaml, but still it persists and validation errors are coming from Legacy.

To reproduce Query is:

query getCall1($cursor: String, $size: Int) {
  call1: getCall1(cursor: $cursor, size: $size) {
    results {
      id
      payload {
        ...Payload1
        __typename
      }
      sender
      __typename
    }
    __typename
  }
}

fragment Payload1 on Payload1 {
  field1
  type
  ...Type1
  ...Type2
  __typename
}

fragment Type1 on Type1 {
  field2
  __typename
}

fragment Type2 on Type2 {
    field3
  __typename
}

and say superschema doesn't have Type2 info. The error I am getting is

{
	"errors": [
		{
			"message": "Cannot find fragment name \"Type2\" (provided fragments are: [Type1, Type3])",
			"extensions": {
				"code": "GRAPHQL_VALIDATION_FAILED"
			}
		}
	]
}

Any help is appreciated, thank you.

Aug 01 '24 10:08 sushant3524

@sushant3524 Accepting invalid queries in the router may cause all kind of unsupported glitches in the router and in the query planner. The router does not do validation in JS but I think the query planner may validate its own outputs at the end of query planning. Subgraphs also usually validate queries and if they return an error the router will propagate it. Everything the router does relies on the query being valid so trying to pass through something invalid is asking for trouble.

Aug 01 '24 15:08 goto-bus-stop

Thanks @goto-bus-stop for the quick revert. Have reviewed much of your code in the router repository and it is a pleasure to finally engage with you. I understand and accept the risks and will very well work on it. I just want that router doesn't validate through JS and as this PR removes it, I am not sure how and why JS is validating.

Is it because of #4964 ? Or is it because router uses JS for some other usecase like maybe query_planner and there in JS if the query looks fishy they validate it again?

Aug 02 '24 03:08 sushant3524

after searching I have found out that this is coming in bridge_query_planner from PlannerMode. Are we working on removing dependency from JS for it?

Aug 05 '24 09:08 sushant3524

router router copied to clipboard

Remove legacy validation

router
router copied to clipboard