graphql-go
graphql-go copied to clipboard
WIP: add parallel breath-first execution
This change is still WIP and is not yet ready to merge. But it passes all test already and I would like to coordinate the merge. Are you interested in a change like this or should I consider starting a fork?
Problem Description
Currently, graph-gophers/graphql-go
resolves nodes in depth-first order. This is bad for data-loaders, because a common query like posts { author; comments { author }}
is resolved like posts
-> author of post 1
-> comments of post 1
-> authors of comment ...
-> author of post 2
-> comments of post 2
, etc...
This gives data-loaders no clear dispatch point, therefore dataloaders have to use some time based heuristics (e.g. wait for 100ms before dispatching) which leads to increased query latency. The depth-first traversal is also the worst case scenario for such time based heuristics.
Another problem with time based heuristics is that they degrade horribly during load. When the system load is low, the resolvers will be batched into a single database query. But once the load increases, the time limit might be to tight and a lot of database queries will be generated, adding even more load to an already overloaded system.
The official reference implementation does not have this problem. The JavaScript VM ensures that all resolvers are executed during the same tick within the event loop (unless they do a system call or anything else which triggers a new event), and dataloaders are registered to run at the next tick. Unfortunately, no such solution is possible in Go.
Proposed Solution
To avoid this problem, GraphQL libraries for Java and .Net are resolving nodes in breath-first order by default. This will load all posts at level 1, then all authors (of posts) and all comments at level 2 and then all authors (of comments) at level 3. Dataloaders can be configured to dispatch the queries after each level.
Status of this PR
This change implements breath-first traversal and already passes all test cases.
The subscription code still uses the old exec code. This should be refactored too before merging this.
The previous exec code did a wg.Wait()
after resolving each node's children. Therefore, if there was a resolver that took a considerable amount of time (e.g. 1s), The response took 1s * number_of_nodes
. The new code does only one wg.Wait()
per level. This seems to be ideal, because proceeding earlier might have negative effects on data loaders anyway.
I have split the code for resolving and generating the JSON output into separate functions, making the whole code much easier to read.
The previous code used one bytes.Buffer
per node. The new code only uses one for the whole request, but requires an in-memory tree of all resolved reflect.Value
s during the request.
The intention of the internal async
flag was not clear to me. It looks like that goroutines are only used when there is a resolver with a context field. But once there is at least one resolver that takes a context as argument, all resolvers will be executed within goroutines. This behavior seems strange. I have not ported it yet.
This change does not change or add any APIs. We might want to consider exposing configurable execution strategies in the future (the Java GraphQL implementation offers this feature). More important, having a subscribe-able event for data-loaders to know when a level has been fully scheduled to be resolved is important. We might require an API that can return Thunks
or something similar to fully support this feature.
I don't have enough knowledge of internal working to comment on the implementation but thank you for the contribution to make this library perform better in production. Also bringing your experience with libraries from other language. I am sure core maintainers would respond soon but just wanted to express my excitement of your contribution to improve this library. thank you
@tux21b - I'm curious if it would be possible for me to contribute to this. My company recently ran into a problem I've flagged here: https://github.com/graph-gophers/graphql-go/issues/375
I think this solution is best for both addressing the runaway go routines + providing breadth first execution. Let me know how I can help (is it just subscriptions?).
@pavelnikolov, just tagging you here cause it looks like you're one of the maintainers? If you are, I'm curious what you think about this breadth first execution? Would you or any other maintainer be opposed if this or any other PR was proposed to do this?
Hi @shannontan-bolt I am willing to consider this if there is a benchmark that clearly demonstrates the benefits of this. I am happy to consider/merge any improvement to this library that would be beneficial to the users and wouldn't sacrifice simplicity.
@pavelnikolov FYI, I didn't write this PR, but I don't know if the author intends on pushing it through. I was considering tackling it because our system has run into this:
https://github.com/graph-gophers/graphql-go/issues/375
if the author doesn't respond, I'm going to try to take a crack at it sometime over the next month (we've just imposed really strict query limit sizes in our codebase)
I am definitely going to consider breadth-first execution but I would like benchmarks that demonstrate a before/after comparison and clearly shows the benefit. Also, I expect that this wouldn't introduce any breaking changes to the public API.
Closing this PR due to inactivity and since #400 takes care of it.
@pavelnikolov I don't see how the linked PR takes care of the execution order though, right? If I interpret it correctly, then this library is still doing depth-first execution, which still requires any form of batch loading to rely on a very imprecise time-based heuristic.