graphql-go
graphql-go copied to clipboard
Control memory explosion on large list of queries
In our use case (server to server), we were making batch queries to our Go server and ran into a problem when a single request started to exceed about queries 100. In our use case, we needed to pack a bunch of queries together as we can very effectively parallelise them and want to hide the request latencies.
Performance would collapse and the resident memory would spike up to 10s of GB in production where the requests had ~3000 queries. I added a benchmark BenchmarkFragmentQueries
to graphql_test.go
and with a little profiling it looked like almost all of this was due to the validation. With an early return on the validation, the time and memory overheads would vanish.
https://github.com/neelance/graphql-go/blob/master/internal/validation/validation.go#L174-L178 causes a ^2 explosion of memory due to https://github.com/neelance/graphql-go/blob/master/internal/validation/validation.go#L389-L393 - the keys start taking up a huge amount of space causing the hashmap implementation to struggle while it expands buckets internally to cope. This memory also does not get released with the GC of the map itself. To control this, this PR stops the use of the map if the selection length is greater than 100 which causes more computation but is faster than fighting the map. I also made a few other tweaks to improve performance and it now works ok for our usecase with MBs instead of GBs of RSS and sub second overheads.
The numbers get much worse if field name overlaps due to validateFieldOverlap
not early terminating but I skipped fixes/investigation as that does not look like a common issue in the real world. However, if you try it the current master will exceed the 10min limit for a benchmark test so likely needs some guards to prevent DoS issues with public APIs. I believe this can all be a bit smarter as the queries have the same repeating fragments but I am new to this codebase.
Performance tests
The headline for us was reducing 65s (of mainly validation) down to 1s for 10,000 queries in the request.
# master with the new test added
$ go test -bench=FragmentQueries -benchmem > /tmp/master-graphql.txt
# this fork
$ go test -bench=FragmentQueries -benchmem > /tmp/branch-graphql.txt
benchcmp /tmp/master-graphql.txt /tmp/branch-graphql.txt
benchmark old ns/op new ns/op delta
BenchmarkFragmentQueries/1_queries_non-overlapping_aliases-8 42890 41963 -2.16%
BenchmarkFragmentQueries/10_queries_non-overlapping_aliases-8 239469 201074 -16.03%
BenchmarkFragmentQueries/100_queries_non-overlapping_aliases-8 3337688 2820303 -15.50%
BenchmarkFragmentQueries/1000_queries_non-overlapping_aliases-8 376659442 24807675 -93.41%
BenchmarkFragmentQueries/10000_queries_non-overlapping_aliases-8 65201269084 1067392258 -98.36%
benchmark old allocs new allocs delta
BenchmarkFragmentQueries/1_queries_non-overlapping_aliases-8 254 252 -0.79%
BenchmarkFragmentQueries/10_queries_non-overlapping_aliases-8 1981 1973 -0.40%
BenchmarkFragmentQueries/100_queries_non-overlapping_aliases-8 19559 19245 -1.61%
BenchmarkFragmentQueries/1000_queries_non-overlapping_aliases-8 247891 189323 -23.63%
BenchmarkFragmentQueries/10000_queries_non-overlapping_aliases-8 7900911 1916028 -75.75%
benchmark old bytes new bytes delta
BenchmarkFragmentQueries/1_queries_non-overlapping_aliases-8 16300 14488 -11.12%
BenchmarkFragmentQueries/10_queries_non-overlapping_aliases-8 146670 126303 -13.89%
BenchmarkFragmentQueries/100_queries_non-overlapping_aliases-8 2322911 1674979 -27.89%
BenchmarkFragmentQueries/1000_queries_non-overlapping_aliases-8 169845173 10843020 -93.62%
BenchmarkFragmentQueries/10000_queries_non-overlapping_aliases-8 10961889744 115822224 -98.94%
Have zapped the gitignore as requested.
Can someone merge this?
There are merge conflicts to be fixed first...
The situation occurs quickly with large requests size. I've sent a ~100KB query string and it rapidly leaked to more than 1GB of memory at the overlapValidated
map.
c.overlapValidated[selectionPair{a, b}] = struct{}{}
I'd be willing to put together a PR with the merge conflicts fixed here, as well as the benchmark fix. If we did this, how would we want to approach the threshold? Make it configurable? Or should we just allow 100 to be the default?
Thank you, @svanburen! I would prefer if this is opt-in and the threshold is configurable.
I've been playing around with this for a bit and have it mostly working - the only issue is that given the current implementation, if we make the threshold configurable and it's below a certain value, the validation tests will panic (a value of 1 or 2 seems to trigger this).
I can put up a PR with the failing tests, but any initial thoughts here? I'm wondering if we should ignore the configuration for the time being and just get in a merge-able PR?
There are merge conflicts to be fixed first...
Fixed