graphql-go icon indicating copy to clipboard operation
graphql-go copied to clipboard

Control memory explosion on large list of queries

Open rahulpowar opened this issue 7 years ago • 8 comments

In our use case (server to server), we were making batch queries to our Go server and ran into a problem when a single request started to exceed about queries 100. In our use case, we needed to pack a bunch of queries together as we can very effectively parallelise them and want to hide the request latencies.

Performance would collapse and the resident memory would spike up to 10s of GB in production where the requests had ~3000 queries. I added a benchmark BenchmarkFragmentQueries to graphql_test.go and with a little profiling it looked like almost all of this was due to the validation. With an early return on the validation, the time and memory overheads would vanish.

https://github.com/neelance/graphql-go/blob/master/internal/validation/validation.go#L174-L178 causes a ^2 explosion of memory due to https://github.com/neelance/graphql-go/blob/master/internal/validation/validation.go#L389-L393 - the keys start taking up a huge amount of space causing the hashmap implementation to struggle while it expands buckets internally to cope. This memory also does not get released with the GC of the map itself. To control this, this PR stops the use of the map if the selection length is greater than 100 which causes more computation but is faster than fighting the map. I also made a few other tweaks to improve performance and it now works ok for our usecase with MBs instead of GBs of RSS and sub second overheads.

The numbers get much worse if field name overlaps due to validateFieldOverlap not early terminating but I skipped fixes/investigation as that does not look like a common issue in the real world. However, if you try it the current master will exceed the 10min limit for a benchmark test so likely needs some guards to prevent DoS issues with public APIs. I believe this can all be a bit smarter as the queries have the same repeating fragments but I am new to this codebase.

Performance tests

The headline for us was reducing 65s (of mainly validation) down to 1s for 10,000 queries in the request.

# master with the new test added
$ go test -bench=FragmentQueries -benchmem > /tmp/master-graphql.txt

# this fork
$  go test -bench=FragmentQueries -benchmem > /tmp/branch-graphql.txt

benchcmp /tmp/master-graphql.txt /tmp/branch-graphql.txt
benchmark                                                            old ns/op       new ns/op      delta
BenchmarkFragmentQueries/1_queries_non-overlapping_aliases-8         42890           41963          -2.16%
BenchmarkFragmentQueries/10_queries_non-overlapping_aliases-8        239469          201074         -16.03%
BenchmarkFragmentQueries/100_queries_non-overlapping_aliases-8       3337688         2820303        -15.50%
BenchmarkFragmentQueries/1000_queries_non-overlapping_aliases-8      376659442       24807675       -93.41%
BenchmarkFragmentQueries/10000_queries_non-overlapping_aliases-8     65201269084     1067392258     -98.36%

benchmark                                                            old allocs     new allocs     delta
BenchmarkFragmentQueries/1_queries_non-overlapping_aliases-8         254            252            -0.79%
BenchmarkFragmentQueries/10_queries_non-overlapping_aliases-8        1981           1973           -0.40%
BenchmarkFragmentQueries/100_queries_non-overlapping_aliases-8       19559          19245          -1.61%
BenchmarkFragmentQueries/1000_queries_non-overlapping_aliases-8      247891         189323         -23.63%
BenchmarkFragmentQueries/10000_queries_non-overlapping_aliases-8     7900911        1916028        -75.75%

benchmark                                                            old bytes       new bytes     delta
BenchmarkFragmentQueries/1_queries_non-overlapping_aliases-8         16300           14488         -11.12%
BenchmarkFragmentQueries/10_queries_non-overlapping_aliases-8        146670          126303        -13.89%
BenchmarkFragmentQueries/100_queries_non-overlapping_aliases-8       2322911         1674979       -27.89%
BenchmarkFragmentQueries/1000_queries_non-overlapping_aliases-8      169845173       10843020      -93.62%
BenchmarkFragmentQueries/10000_queries_non-overlapping_aliases-8     10961889744     115822224     -98.94%

rahulpowar avatar Jul 07 '17 14:07 rahulpowar

Have zapped the gitignore as requested.

rahulpowar avatar Jul 22 '17 12:07 rahulpowar

Can someone merge this?

aaahrens avatar Sep 10 '18 20:09 aaahrens

There are merge conflicts to be fixed first...

pavelnikolov avatar Oct 04 '18 02:10 pavelnikolov

The situation occurs quickly with large requests size. I've sent a ~100KB query string and it rapidly leaked to more than 1GB of memory at the overlapValidated map.

c.overlapValidated[selectionPair{a, b}] = struct{}{}

andreiavrammsd avatar Mar 14 '19 14:03 andreiavrammsd

I'd be willing to put together a PR with the merge conflicts fixed here, as well as the benchmark fix. If we did this, how would we want to approach the threshold? Make it configurable? Or should we just allow 100 to be the default?

stefanvanburen avatar Mar 14 '19 14:03 stefanvanburen

Thank you, @svanburen! I would prefer if this is opt-in and the threshold is configurable.

pavelnikolov avatar Mar 14 '19 20:03 pavelnikolov

I've been playing around with this for a bit and have it mostly working - the only issue is that given the current implementation, if we make the threshold configurable and it's below a certain value, the validation tests will panic (a value of 1 or 2 seems to trigger this).

I can put up a PR with the failing tests, but any initial thoughts here? I'm wondering if we should ignore the configuration for the time being and just get in a merge-able PR?

stefanvanburen avatar Mar 15 '19 01:03 stefanvanburen

There are merge conflicts to be fixed first...

Fixed

dmotylev avatar Mar 25 '20 09:03 dmotylev