Proposal: deterministic batch loading
Context
Dataloaders
As described in the documention, calls to databases or other external services need to be batched to achieve decent performance. This is generally achieved by using the dataloader pattern.
In golang, the dataloader implementations (dataloaden, graph-gophers/dataloader) work roughly this way:
- dataloader.Load(id) is called
- dataloader.Load blocks for a set amount of time (for example 5ms)
- once the amount of time has elapsed, the user-provided Fetch function is called with the ids passed to the calls of dataloader.Load made during the blocking time, returns the values and the dataloader.Load calls return
This approach has several drawbacks:
- each call to dataloader.Load that starts a new batch blocks N ms, which means nested queries take at least depth * N ms
- dataloader.Load calls need to be thread-safe, which results in significant overhead
- Most importantly, under load (including large graphql queries), unless the wait time is excessively large, the dataloader.Load calls will be made too slowly, resulting in poor batching, reduced performance and increased database load.
gqlgen
gqlgen builds the query results depth-first, top to bottom, and spawns a goroutine for each field in a fieldset, and each element in an array (with a few optimizations). For example, the query query { users { stars; likes }} will roughly result in this Users(); go marshallUser(user1); go Stars(user1); go Likes(user1); go marshallUser(user2)...
This compounds the dataloaders drawbacks since:
- the goroutines aren't spawned in the right order to optimize batching (ideally all the calls to the Stars resolver would be made first, then all the calls to the Likes resolver)
- the approach can result in spawning a large number of goroutines (for example 3000 goroutines for 1000 users), further adding overhead
Previous discussion
https://github.com/99designs/gqlgen/issues/518
Proposal
Add built-in batch-loading to gqlgen. Instead of having the following resolver signature:
func (r *itemResolver) Children(ctx context.Context, obj *model.Item, ...) ([]*model.Item, error)
have this one:
func (r *itemResolver) Children(ctx context.Context, obj []*model.Item, ...) ([][]*model.Item, []error)
Proof of concept (simple schemas work. nulls, error handling, introspection, middlewares and likely many other things don't)
From an implementation standpoint, this would be made by walking the query tree depth-first, calling a resolver once for each node of the query. Taking the example above, this would result in roughly the following execution plan: Users(); go Stars(users); go Likes(users).
Advantages:
- Queries are batched optimally and deterministically
- Few goroutines are spawned and there is no need for synchronization in the resolvers
- The right way to do things (batching) is the default way to do it
- This can be easily be made backwards-compatible by generating code to call the "old-style" resolvers in a loop
Drawbacks:
- If the same data is requested in different parts of the query, it can be fetched only once with dataloaders
- implementing this in gqlgen would be pretty involved
Benchmark
Test query:
query q {
root {
children(num: 100, latency: 0) {
children(num: 100, latency: 0) {
children(num: 100, latency: 0) {
id
}
}
}
}
}
the children resolver returns an array of num items, resulting in 100x100x100 leaf items.
Tested implementations:
- gqlgen + dataloaden
- proof of concept
Execution plan
Each line in the logs below is a call to the dataloader fetch function (gqlgen + dataloaden) or the resolver (proof of concept), p0 is the id of the first item in the argument array, the number at the end is the length of the argument array.
gqlgen+dataloaden:
Items (p0: root): 1
Items (p0: root_99): 100
Items (p0: root_99_99): 352
Items (p0: root_91_99): 489
Items (p0: root_81_99): 762
Items (p0: root_67_99): 1000
Items (p0: root_49_99): 1000
Items (p0: root_31_99): 1000
Items (p0: root_13_99): 1000
Items (p0: root_97_29): 289
Items (p0: root_91_39): 511
Items (p0: root_79_65): 638
Items (p0: root_67_50): 509
Items (p0: root_53_26): 142
Items (p0: root_48_82): 774
Items (p0: root_31_34): 800
Items (p0: root_13_89): 533
Items (p0: root_92_73): 26
Items (p0: root_54_60): 63
Items (p0: root_50_69): 86
Items (p0: root_32_80): 26
Batching works well at the beginning (1000 is the max batch size), but degrades as the processing progresses.
proof of concept:
Items (p0: root): 1
Items (p0: root_0): 100
Items (p0: root_0_0): 10000
The optimal number of calls is made.
Rough performance numbers
With GOMAXPROCS=1, the proof of concept does 570ms/req and gqlgen+dataloaden 2.2s/req With GOMAXPROCS=4, the proof of concept does 420ms/req and gqlgen+dataloaden does 950ms/req
Note: a real-life implementation of the proof of concept could:
- either be a little slower, since several things such as nulls handling and error handling aren't implemented.
- either be a little faster, because the proof of concept does some unnecessary slice resizing and allocations
Discussion
Is this something that would fit with gqlgen?
Any alternative approaches or suggestions?
Appreciate the feedback.
Would love to see this. It would be majorly breaking, but probably worth it!
Agreed, maybe gqlgen could allow you to opt into it per resolver?
I'm wondering if there is a coroutine solution here with the new iter.Seq and iter.Pull? Haven't thought too deeply on it yet, but I have been interested in more "coroutine"-like batching for gqlgen.