Use a queue for execution
I'm feeling really inspired by @gmac's proof-of-concept over at https://github.com/gmac/graphql-cardinal/, so I thought I'd try again at refactoring execution to use a queue instead of recursive method calls. I have tried before (#4967, #4968, somewhat #3998, #4935) and given up along the way, but I'm going to try again 😅
TODO:
- [x] Extract some refactors from this branch
- [x] Make Lazy and Dataloader work
- [x] Consider steps for
continue_value,resolve_list_item - [x] Use steps for
call_method_on_directives - [x] Merge
ResolveTypeStepintoResultHash - [x] Merge directive checks into steps too?
- [x] Merge
authorized_newlazy usage into ResultHash - Look for other
after_lazyusages and rebuild them using queue steps - [x] Refactor ResultArray and ResultHash into state machines
- [ ] Merge ResultHash into Schema::Object
- [x] Fix dataloaded arguments
- [x] RunQueue is shared by the multiplex -- can it be merged into one of the other objects with the same lifecycle? (
Multiplex,Interpreter,Dataloader) - [x] Tighter integration between Dataloader, Lazy, and Queue?
- [ ] Merge Field extensions into FieldResolveStep to reduce overhead when used?
- Tests that are still failing:
- [ ] current_field, current_arguments in context
- [ ] rescuing authorized errors
- [ ] batching lazies correctly
- [ ] supporting parallel I/O in dataloader
- ...
There's still a lot to do here for compatibility and performance, but the initial benchmark results are surprisingly good (even with the terrible FieldResolutionStep implementation which is hogging memory). This is using a modified version of the benchmark from graphql-cardinal:
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [x86_64-darwin22]
Calculating -------------------------------------
graphql-ruby: 142 resolvers
- 399.549 (± 8.5%) i/s (2.50 ms/i) - 2.000k in 5.048521s
+ 538.453 (± 3.9%) i/s (1.86 ms/i) - 2.695k in 5.013324s
graphql-ruby: 140002 resolvers
- 0.517 (± 0.0%) i/s (1.93 s/i) - 3.000 in 5.797333s
+ 0.761 (± 0.0%) i/s (1.31 s/i) - 4.000 in 5.272811s
- Total allocated: 10149304 bytes (70353 objects)
+ Total allocated: 13431224 bytes (97389 objects)
So that's a 34% or 47% speedup. It's using 30% more memory, because it's currently the Simplest Thing That Could Possibly Work. Presumably refactoring it to avoid those repeated allocations will make it even faster!
I haven't pushed a commit here for a while but I'm still planning on landing this somehow. Here's a quick brain-dump in case anyone is curious:
- The perf win is not as big as it was initially because I added objects (
FieldResolveStep) for each field resolution. Amazingly it's still faster than the previous implementation. But it's 20%-30% IIRC. - It might get slower because the shift to this kind of flow will also require queue objects for Argument resolution (basically anywhere that can return a Promise)
- The time tradeoff comes at the expense of memory which I think is worth it. I have some ideas for reducing memory once this is really working:
- Merge
GraphQLResultHashintoGraphQL::Schema::Object-- they have the same lifecycle at runtime so they could be merged. - Merge Field::ExtendedState into FieldResolveStep; this will save an allocation in possibly-tight loops when field extensions are added to repeatedly-used fields.
- Introduce "optimized" field execution. Sometimes we can detect when a field could use a much simpler execution flow, eg "just call a method and insert the result into the hash", without preparing arguments, running extensions, etc. This could cover a lot of basic fields and speed things up, but we still have to be ready for Promises and Dataloader pauses.
- Merge
- Merging and releasing this will be tricky. I expect to have full compatibility when I'm done (or very-nearly-full 😅 ) but it's such a huge change that I want to make sure it's a gentle release so nobody gets nasty surprises. Some thoughts:
- Many of the changes in this branch could be isolated and released separately: merging Dataloader and Lazy resolution; merging ResultHash and Schema::Object; moving execution code into named methods on Step classes;
- After that, you could switch from stack-based to queue-based using code that was mostly in place. Ideally both implementations would be available (and runtime-swappable) but I think the implementation will have to reveal itself once the other changes have been merged.