graphql-ruby icon indicating copy to clipboard operation
graphql-ruby copied to clipboard

Dataloader fiber memory allocation errors leave fibers running after transaction completes

Open obrie opened this issue 4 months ago • 3 comments

Describe the bug

I recently upgraded our application to the latest version of graphql-ruby and converted all of our uses of graphql-batch to graphql-ruby's Dataloader framework. After deploying to production, we've been experiencing this error occasionally:

FiberError (can't set a guard page: Cannot allocate memory)

I was able to reproduce it with a large enough graphql query and added some additional logging:

Oct 17 15:21:23 ip-... app-web[1]: I, [2024-10-17T15:21:23.838743 #18363]  INFO -- : Fiber stats: {"Fiber"=>6}
...
Oct 17 15:21:49 ip-... app-web[1]: F, [2024-10-17T15:21:49.679186 #18363] FATAL -- : FiberError (can't set a guard page: Cannot allocate memory):
Oct 17 15:21:49 ip-... app-web[1]: F, [2024-10-17T15:21:49.679232 #18363] FATAL -- :
Oct 17 15:21:49 ip-... app-web[1]: F, [2024-10-17T15:21:49.679327 #18363] FATAL -- : .../gems/graphql-2.3.16/lib/graphql/dataloader.rb:231:in `resume'
Oct 17 15:21:49 ip-... app-web[1]: .../gems/graphql-2.3.16/lib/graphql/dataloader.rb:231:in `run_fiber'
Oct 17 15:21:49 ip-... app-web[1]: .../gems/graphql-2.3.16/lib/graphql/dataloader.rb:192:in `block in run'
Oct 17 15:21:49 ip-... app-web[1]: .../gems/graphql-2.3.16/lib/graphql/dataloader.rb:238:in `block in spawn_fiber'
Oct 17 15:21:49 ip-... app-web[1]: .../gems/newrelic_rpm-9.13.0/lib/new_relic/agent/tracer.rb:434:in `block (2 levels) in thread_block_with_current_transaction'
Oct 17 15:21:49 ip-... app-web[1]: .../gems/newrelic_rpm-9.13.0/lib/new_relic/agent/tracer.rb:357:in `capture_segment_error'
Oct 17 15:21:49 ip-... app-web[1]: .../gems/newrelic_rpm-9.13.0/lib/new_relic/agent/tracer.rb:433:in `block in thread_block_with_current_transaction'

I also had a thread going to track how many fibers were running every 5s:

Thread.new do
  while true
    sleep 5
    stats = {}
    stats["Fiber"] = 0
    ObjectSpace.each_object(Fiber) {|o| stats["Fiber"] += 1 if o.alive?}
    Rails.logger.info "Fiber stats: #{stats.inspect}"
  end
end

I noticed that when we encounter this fiber error (which we need to fix), it also leaves all of the pending fibers still running:

Oct 17 15:26:29 ip-... app-web[1]: I, [2024-10-17T15:26:29.917993 #18363]  INFO -- : Fiber stats: {"Fiber"=>31749}

The above log message is ~5 minutes after the previous transaction failed with a 500 error code.

The impact of this is that any future requests fail immediately because the number of running fibers has reached its max and any attempted use of Dataloader will fail. We end up in a situation where our Unicorn worker is still running, but effectively completely dead -- it'll just keep immediately returning 500 due to the exception.

It seems like there are 2 issues going on here:

  1. We're hitting the default vm.max_map_count max (65k), which I suppose we'll need to increase or figure out a way to limit
  2. We're seeing fibers not clean up after themselves once the transaction fails, causing unicorn workers to become unusable

This issue is to look at (2).

Versions

graphql version: 2.3.16 rails (or other framework): 5.2.8.1 other applicable versions (graphql-batch, etc)

GraphQL schema

Not sure it's relevant. We have several custom Dataloader sources, many of which can get triggered many times over within a single query.

We also have NewRelic enabled.

GraphQL query

N/A for the moment

Steps to reproduce

  1. Generate a query that causes enough fibers to be created through Dataloader to trigger the above error
  2. Observe the number of fibers that remain after the transaction has completed

Expected behavior

Expect the number of fibers to return to the previous number

Actual behavior

Number of fibers remains at the point of failure after the transaction completes.

Click to view exception backtrace
Oct 17 15:21:23 ip-... app-web[1]: I, [2024-10-17T15:21:23.838743 #18363]  INFO -- : Fiber stats: {"Fiber"=>6}
...
Oct 17 15:21:49 ip-... app-web[1]: F, [2024-10-17T15:21:49.679186 #18363] FATAL -- : FiberError (can't set a guard page: Cannot allocate memory):
Oct 17 15:21:49 ip-... app-web[1]: F, [2024-10-17T15:21:49.679232 #18363] FATAL -- :
Oct 17 15:21:49 ip-... app-web[1]: F, [2024-10-17T15:21:49.679327 #18363] FATAL -- : .../gems/graphql-2.3.16/lib/graphql/dataloader.rb:231:in `resume'
Oct 17 15:21:49 ip-... app-web[1]: .../gems/graphql-2.3.16/lib/graphql/dataloader.rb:231:in `run_fiber'
Oct 17 15:21:49 ip-... app-web[1]: .../gems/graphql-2.3.16/lib/graphql/dataloader.rb:192:in `block in run'
Oct 17 15:21:49 ip-... app-web[1]: .../gems/graphql-2.3.16/lib/graphql/dataloader.rb:238:in `block in spawn_fiber'
Oct 17 15:21:49 ip-... app-web[1]: .../gems/newrelic_rpm-9.13.0/lib/new_relic/agent/tracer.rb:434:in `block (2 levels) in thread_block_with_current_transaction'
Oct 17 15:21:49 ip-... app-web[1]: .../gems/newrelic_rpm-9.13.0/lib/new_relic/agent/tracer.rb:357:in `capture_segment_error'
Oct 17 15:21:49 ip-... app-web[1]: .../gems/newrelic_rpm-9.13.0/lib/new_relic/agent/tracer.rb:433:in `block in thread_block_with_current_transaction'

Additional context

None at the moment, but I'll continue to dig into this since it's impacting our production environment and I'm only rolling forward at this point.

obrie avatar Oct 17 '24 15:10 obrie