graphql-ruby
graphql-ruby copied to clipboard
Dataloader fiber memory allocation errors leave fibers running after transaction completes
Describe the bug
I recently upgraded our application to the latest version of graphql-ruby and converted all of our uses of graphql-batch to graphql-ruby's Dataloader framework. After deploying to production, we've been experiencing this error occasionally:
FiberError (can't set a guard page: Cannot allocate memory)
I was able to reproduce it with a large enough graphql query and added some additional logging:
Oct 17 15:21:23 ip-... app-web[1]: I, [2024-10-17T15:21:23.838743 #18363] INFO -- : Fiber stats: {"Fiber"=>6}
...
Oct 17 15:21:49 ip-... app-web[1]: F, [2024-10-17T15:21:49.679186 #18363] FATAL -- : FiberError (can't set a guard page: Cannot allocate memory):
Oct 17 15:21:49 ip-... app-web[1]: F, [2024-10-17T15:21:49.679232 #18363] FATAL -- :
Oct 17 15:21:49 ip-... app-web[1]: F, [2024-10-17T15:21:49.679327 #18363] FATAL -- : .../gems/graphql-2.3.16/lib/graphql/dataloader.rb:231:in `resume'
Oct 17 15:21:49 ip-... app-web[1]: .../gems/graphql-2.3.16/lib/graphql/dataloader.rb:231:in `run_fiber'
Oct 17 15:21:49 ip-... app-web[1]: .../gems/graphql-2.3.16/lib/graphql/dataloader.rb:192:in `block in run'
Oct 17 15:21:49 ip-... app-web[1]: .../gems/graphql-2.3.16/lib/graphql/dataloader.rb:238:in `block in spawn_fiber'
Oct 17 15:21:49 ip-... app-web[1]: .../gems/newrelic_rpm-9.13.0/lib/new_relic/agent/tracer.rb:434:in `block (2 levels) in thread_block_with_current_transaction'
Oct 17 15:21:49 ip-... app-web[1]: .../gems/newrelic_rpm-9.13.0/lib/new_relic/agent/tracer.rb:357:in `capture_segment_error'
Oct 17 15:21:49 ip-... app-web[1]: .../gems/newrelic_rpm-9.13.0/lib/new_relic/agent/tracer.rb:433:in `block in thread_block_with_current_transaction'
I also had a thread going to track how many fibers were running every 5s:
Thread.new do
while true
sleep 5
stats = {}
stats["Fiber"] = 0
ObjectSpace.each_object(Fiber) {|o| stats["Fiber"] += 1 if o.alive?}
Rails.logger.info "Fiber stats: #{stats.inspect}"
end
end
I noticed that when we encounter this fiber error (which we need to fix), it also leaves all of the pending fibers still running:
Oct 17 15:26:29 ip-... app-web[1]: I, [2024-10-17T15:26:29.917993 #18363] INFO -- : Fiber stats: {"Fiber"=>31749}
The above log message is ~5 minutes after the previous transaction failed with a 500 error code.
The impact of this is that any future requests fail immediately because the number of running fibers has reached its max and any attempted use of Dataloader will fail. We end up in a situation where our Unicorn worker is still running, but effectively completely dead -- it'll just keep immediately returning 500 due to the exception.
It seems like there are 2 issues going on here:
- We're hitting the default
vm.max_map_count
max (65k), which I suppose we'll need to increase or figure out a way to limit - We're seeing fibers not clean up after themselves once the transaction fails, causing unicorn workers to become unusable
This issue is to look at (2).
Versions
graphql
version: 2.3.16
rails
(or other framework): 5.2.8.1
other applicable versions (graphql-batch
, etc)
GraphQL schema
Not sure it's relevant. We have several custom Dataloader sources, many of which can get triggered many times over within a single query.
We also have NewRelic enabled.
GraphQL query
N/A for the moment
Steps to reproduce
- Generate a query that causes enough fibers to be created through Dataloader to trigger the above error
- Observe the number of fibers that remain after the transaction has completed
Expected behavior
Expect the number of fibers to return to the previous number
Actual behavior
Number of fibers remains at the point of failure after the transaction completes.
Click to view exception backtrace
Oct 17 15:21:23 ip-... app-web[1]: I, [2024-10-17T15:21:23.838743 #18363] INFO -- : Fiber stats: {"Fiber"=>6}
...
Oct 17 15:21:49 ip-... app-web[1]: F, [2024-10-17T15:21:49.679186 #18363] FATAL -- : FiberError (can't set a guard page: Cannot allocate memory):
Oct 17 15:21:49 ip-... app-web[1]: F, [2024-10-17T15:21:49.679232 #18363] FATAL -- :
Oct 17 15:21:49 ip-... app-web[1]: F, [2024-10-17T15:21:49.679327 #18363] FATAL -- : .../gems/graphql-2.3.16/lib/graphql/dataloader.rb:231:in `resume'
Oct 17 15:21:49 ip-... app-web[1]: .../gems/graphql-2.3.16/lib/graphql/dataloader.rb:231:in `run_fiber'
Oct 17 15:21:49 ip-... app-web[1]: .../gems/graphql-2.3.16/lib/graphql/dataloader.rb:192:in `block in run'
Oct 17 15:21:49 ip-... app-web[1]: .../gems/graphql-2.3.16/lib/graphql/dataloader.rb:238:in `block in spawn_fiber'
Oct 17 15:21:49 ip-... app-web[1]: .../gems/newrelic_rpm-9.13.0/lib/new_relic/agent/tracer.rb:434:in `block (2 levels) in thread_block_with_current_transaction'
Oct 17 15:21:49 ip-... app-web[1]: .../gems/newrelic_rpm-9.13.0/lib/new_relic/agent/tracer.rb:357:in `capture_segment_error'
Oct 17 15:21:49 ip-... app-web[1]: .../gems/newrelic_rpm-9.13.0/lib/new_relic/agent/tracer.rb:433:in `block in thread_block_with_current_transaction'
Additional context
None at the moment, but I'll continue to dig into this since it's impacting our production environment and I'm only rolling forward at this point.