sangria icon indicating copy to clipboard operation
sangria copied to clipboard

Optimize OverlappingFieldsCanBeMerged

Open objmagic opened this issue 8 years ago • 15 comments

QueryReducer.rejectComplexQueries become stuck with a 412KBytes GraphQL query selecting ~150000 fields. Did a profiling and stacktrace shows we are doing tremendous amount of work in OverlappingFieldsCanBeMerged

An example of the query is at here

We are using:

  • Sangria version 1.3.2
  • JVM 8 (with some Twitter internal customization)

A screenshot of part of the stacktrace (sorry I cannot provide complete stacktrace).

image

objmagic avatar Nov 20 '17 18:11 objmagic

Thanks for reporting the issue! Also, the example query would be quite useful. I think the main issue is with the OverlappingFieldsCanBeMerged validation. It is one of the most complex validations, and in this particular query, it probably does a lot of comparisons between all these top-level selection sets.

I will investigate the root cause.

OlegIlyenko avatar Nov 20 '17 18:11 OlegIlyenko

@OlegIlyenko

Thanks a lot for your prompt reply. I would love to work through this issue with you.

Let me see if I can get a better stacktrace.

objmagic avatar Nov 20 '17 18:11 objmagic

This would be awesome! 👍 As a first step, I think it would be really helpful to verify that mentioned validation is indeed the bottleneck. Can you please exclude it in your tests:

QueryReducerExecutor.reduceQueryWithoutVariables(
  ...
  queryReducers = ...,
  queryValidator = 
    new RuleBasedQueryValidator(
      QueryValidator.allRules
        .filterNot(_.isInstanceOf[OverlappingFieldsCanBeMerged])))

// or

Executor.execute(
  ...
  queryReducers = ...,
  queryValidator = 
    new RuleBasedQueryValidator(
      QueryValidator.allRules
        .filterNot(_.isInstanceOf[OverlappingFieldsCanBeMerged])))

If we can verify that it cases the slowdown, then we can focus on OverlappingFieldsCanBeMerged.

OlegIlyenko avatar Nov 20 '17 19:11 OlegIlyenko

remove the OverlappingFieldsCanBeMerged validator solves the issue.

objmagic avatar Nov 20 '17 19:11 objmagic

Great, this means that we can isolate the issue and focus the effort on optimising the OverlappingFieldsCanBeMerged validation. Maybe you could also change the issue title to reflect this fact?

If you would like to investigate the issue as well, here are some useful resources:

OlegIlyenko avatar Nov 20 '17 19:11 OlegIlyenko

Now that I think about it, recent bugfix introduced a change that reduces usage of the cache. i described this issue here: https://github.com/graphql/graphql-js/pull/780#pullrequestreview-49855652

This might be a promising path to investigate. For example, we can revert the bugfix and see how it performs without it. This commit introduced the bugfix: https://github.com/sangria-graphql/sangria/commit/9866a7bfab8d840185b4d646ee8909e9b9882749#diff-9ce8420e3a37d01d401ac1bc89e17e6e

OlegIlyenko avatar Nov 20 '17 19:11 OlegIlyenko

The validation rules are normally completely self-contained. So you can just copy/paste this old version of the validation under a different name in your test and add it in the RuleBasedQueryValidator instead of the new version of OverlappingFieldsCanBeMerged.

OlegIlyenko avatar Nov 20 '17 20:11 OlegIlyenko

Thanks. I'll try to find some time to debug this.

objmagic avatar Nov 21 '17 00:11 objmagic

Notice this problem can introduce serious problem. When doing curl on our API server, it takes 8~10 mins to get back a bad result. That said, attacker construct maliciously large query to DDoS your production service.

objmagic avatar Nov 21 '17 00:11 objmagic

Indeed, this is a concern.

Assuming that parsing and OverlappingFieldsCanBeMerged as well as other validations are well optimized. If incoming request is allowed to have unlimited payload size, then eventually parsing and validation become too slow, it is just a matter of payload size.

So far I thought about following possible solutions that generally implemented outside of GraphQL execution:

  • Always limit the request payload size
  • Use persisted queries, if possible
  • Track execution time and cancel the execution if it takes more time than a threshold (this can be also limited in scope, e.g. only parsing + validation). This is actually something that library can make easier to do. Since parsing and validation are CPU-bound tasks, it should be relatively easy to add simple cancellation mechanism (it becomes harder when IO with external systems is involved).

I would be very interested to hear you opinion and ideas on possible solutions and countermeasures for this issue.

OlegIlyenko avatar Nov 21 '17 17:11 OlegIlyenko

For now, we added a simple HTTP filter to filter out request that is above certain threshold.

objmagic avatar Nov 21 '17 18:11 objmagic

image

@OlegIlyenko using old version of the validation file and got the result above. The situation is still bad but I got part of the result back and then got bunch of HTTP timeout failures. That said, it seems the bug fix does not matter much here.

objmagic avatar Dec 01 '17 00:12 objmagic

I see, thanks for investigating this possibility! I guess we need to look at the original algorithm and try to optimise it.

OlegIlyenko avatar Dec 01 '17 01:12 OlegIlyenko

Thanks. I wouldn't worry much now since we have put an HTTP filter at the very front. Let's keep this issue open, just for other people who might have the same concern.

objmagic avatar Dec 01 '17 01:12 objmagic

We also experienced this issue at XING and invested some time to fix it. Here is the PR: https://github.com/sangria-graphql/sangria/pull/458 and here is the description of the algorithm.

SimonAdameit avatar Sep 19 '19 07:09 SimonAdameit

The faster version is now the default one starting from https://github.com/sangria-graphql/sangria/releases/tag/v3.0.0

yanns avatar Sep 27 '22 19:09 yanns