sangria
sangria copied to clipboard
Optimize OverlappingFieldsCanBeMerged
QueryReducer.rejectComplexQueries become stuck with a 412KBytes GraphQL query selecting ~150000 fields. Did a profiling and stacktrace shows we are doing tremendous amount of work in OverlappingFieldsCanBeMerged
An example of the query is at here
We are using:
- Sangria version 1.3.2
- JVM 8 (with some Twitter internal customization)
A screenshot of part of the stacktrace (sorry I cannot provide complete stacktrace).

Thanks for reporting the issue! Also, the example query would be quite useful. I think the main issue is with the OverlappingFieldsCanBeMerged validation. It is one of the most complex validations, and in this particular query, it probably does a lot of comparisons between all these top-level selection sets.
I will investigate the root cause.
@OlegIlyenko
Thanks a lot for your prompt reply. I would love to work through this issue with you.
Let me see if I can get a better stacktrace.
This would be awesome! 👍 As a first step, I think it would be really helpful to verify that mentioned validation is indeed the bottleneck. Can you please exclude it in your tests:
QueryReducerExecutor.reduceQueryWithoutVariables(
...
queryReducers = ...,
queryValidator =
new RuleBasedQueryValidator(
QueryValidator.allRules
.filterNot(_.isInstanceOf[OverlappingFieldsCanBeMerged])))
// or
Executor.execute(
...
queryReducers = ...,
queryValidator =
new RuleBasedQueryValidator(
QueryValidator.allRules
.filterNot(_.isInstanceOf[OverlappingFieldsCanBeMerged])))
If we can verify that it cases the slowdown, then we can focus on OverlappingFieldsCanBeMerged.
remove the OverlappingFieldsCanBeMerged validator solves the issue.
Great, this means that we can isolate the issue and focus the effort on optimising the OverlappingFieldsCanBeMerged validation. Maybe you could also change the issue title to reflect this fact?
If you would like to investigate the issue as well, here are some useful resources:
- The validation implementation: OverlappingFieldsCanBeMerged
- The tests: OverlappingFieldsCanBeMergedSpec
- Reference impl (sangria implementation is very close to it, but has a recent bugfix which is not yet in the ref. impl. - this might case some perf issues as well): OverlappingFieldsCanBeMerged.js (be careful, reference impl. still contains this bug: https://github.com/graphql/graphql-js/pull/780)
Now that I think about it, recent bugfix introduced a change that reduces usage of the cache. i described this issue here: https://github.com/graphql/graphql-js/pull/780#pullrequestreview-49855652
This might be a promising path to investigate. For example, we can revert the bugfix and see how it performs without it. This commit introduced the bugfix: https://github.com/sangria-graphql/sangria/commit/9866a7bfab8d840185b4d646ee8909e9b9882749#diff-9ce8420e3a37d01d401ac1bc89e17e6e
The validation rules are normally completely self-contained. So you can just copy/paste this old version of the validation under a different name in your test and add it in the RuleBasedQueryValidator instead of the new version of OverlappingFieldsCanBeMerged.
Thanks. I'll try to find some time to debug this.
Notice this problem can introduce serious problem. When doing curl on our API server, it takes 8~10 mins to get back a bad result. That said, attacker construct maliciously large query to DDoS your production service.
Indeed, this is a concern.
Assuming that parsing and OverlappingFieldsCanBeMerged as well as other validations are well optimized. If incoming request is allowed to have unlimited payload size, then eventually parsing and validation become too slow, it is just a matter of payload size.
So far I thought about following possible solutions that generally implemented outside of GraphQL execution:
- Always limit the request payload size
- Use persisted queries, if possible
- Track execution time and cancel the execution if it takes more time than a threshold (this can be also limited in scope, e.g. only parsing + validation). This is actually something that library can make easier to do. Since parsing and validation are CPU-bound tasks, it should be relatively easy to add simple cancellation mechanism (it becomes harder when IO with external systems is involved).
I would be very interested to hear you opinion and ideas on possible solutions and countermeasures for this issue.
For now, we added a simple HTTP filter to filter out request that is above certain threshold.

@OlegIlyenko using old version of the validation file and got the result above. The situation is still bad but I got part of the result back and then got bunch of HTTP timeout failures. That said, it seems the bug fix does not matter much here.
I see, thanks for investigating this possibility! I guess we need to look at the original algorithm and try to optimise it.
Thanks. I wouldn't worry much now since we have put an HTTP filter at the very front. Let's keep this issue open, just for other people who might have the same concern.
We also experienced this issue at XING and invested some time to fix it. Here is the PR: https://github.com/sangria-graphql/sangria/pull/458 and here is the description of the algorithm.
The faster version is now the default one starting from https://github.com/sangria-graphql/sangria/releases/tag/v3.0.0