graphql-ruby
graphql-ruby copied to clipboard
AsyncDataloader freezes
Describe the bug
When using AsyncDataloader instead of the normal one, operations that involve a Source will freeze and the HTTP request never completes.
Versions
graphql version: 2.2.4
rails (or other framework): 7.1.2
graphql-pro version: 1.25.2
puma version: 6.4
async version: 2.8
GraphQL schema
This seems to happen when using any kind of dataloader source.
GraphQL query
Any query that involves using a dataloader source has this problem.
Steps to reproduce
Enable the async dataloader on a rails 7.1 project that uses dataloader sources, then try doing a query that involves reading from a source
Expected behavior
The query works correctly, as it does with the regular dataloader.
Actual behavior
The HTTP request and ActiveRecord read freeze. On PostgreSQL, the query appears stuck with wait status ClientRead when checking pg_stat_activity.
On the rails process, the HTTP request is never resolved and the ruby code is frozen in results.find { |r| r.id == id } in the fetch method of the Source.
Additional context
The problem occurs regardless of isolation level being set to fiber or not. It only occurs with AsyncDataloader and never occurs with the regular Dataloader.
I looked at other issues and followed the development of the feature and it seems it works for other people, so I am looking for some help to figure out why it does not work for us
Hi! Yes, I don't think it freezes for any operation that uses a Dataloader source, but maybe we can narrow down the problem and find out why it's not working for you. Could you please share:
- The smallest GraphQL query you can find that replicates this bug
- The Ruby source code for the fields and dataloader sources involved in it
- The Ruby stack trace right before the line of code that freezes (eg using
puts caller)
That will give a lot of information about what's going on when it freezes and maybe we can work up a replication script to figure out what's going wrong :+1:
Here's the stack trace:
about to do 'puts caller' and then call profiles.find
/workspaces/flexhire/backend/app/graphql/sources/profile_for_user.rb:5:in `map'
/workspaces/flexhire/backend/app/graphql/sources/profile_for_user.rb:5:in `fetch'
/usr/local/rvm/gems/ruby-3.2.2/gems/graphql-2.2.4/lib/graphql/dataloader/source.rb:136:in `run_pending_keys'
/usr/local/rvm/gems/ruby-3.2.2/gems/graphql-2.2.4/lib/graphql/dataloader/async_dataloader.rb:79:in `each'
/usr/local/rvm/gems/ruby-3.2.2/gems/graphql-2.2.4/lib/graphql/dataloader/async_dataloader.rb:79:in `block in spawn_source_task'
/usr/local/rvm/gems/ruby-3.2.2/gems/async-2.8.0/lib/async/task.rb:161:in `block in run'
/usr/local/rvm/gems/ruby-3.2.2/gems/async-2.8.0/lib/async/task.rb:331:in `block in schedule'
Profile Load (3.5ms) SELECT "profiles".* FROM "profiles" WHERE "profiles"."user_id" = $1 [["user_id", 4607]]
↳ app/graphql/sources/profile_for_user.rb:8:in `block in fetch'
The query:
query {
currentUser {
profile { id }
}
}
The relevant code:
# user_type.rb
class UserType
field :profile, ProfileType, description: "for members only. Associated profile data"
# ... omited irrelevant code
def profile
context.dataloader.with(Sources::ProfileForUser).load(object.id)
end
end
# profile_for_user.rb
class Sources::ProfileForUser < GraphQL::Dataloader::Source
def fetch(users_ids)
profiles = Profile.where(user_id: users_ids)
users_ids.map do |user_id|
puts "about to do 'puts caller' and then call profiles.find"
puts caller
profiles.find { |p| p.user_id == user_id }
end
end
end
The profiles.find call is the one that triggers the lazy evaluation of the SQL query in active record, so I am thinking the problem might not be in graphql-ruby. However using active record in other ways seems to work
I actually noticed it didn't freeze. It eventually finished executing
It took over 3 minutes. This HTTP request takes around 90ms normally
EDIT: also I notice that when it "freezes", one logical CPU thread of my machine is at 100% and the process causing it is the puma web server. So something weird is happening in ruby.
In the logs, active record reports the queries as taking only a few ms
Interesting -- thanks for sharing all those details. My first thought was, maybe it was stuck in the while loop here:
https://github.com/rmosolgo/graphql-ruby/blob/d93e2fbab1747d46ec1089bab56e72875fd02c89/lib/graphql/dataloader/async_dataloader.rb#L38-L43
But, I see that there's a root_task.yield which is meant to address this exact situation: tell Async's event loop to hand off to another task (to see if it's finished).
Unfortunately my suggestion for puts caller was worthless; I forgot that it'd be in a brand new caller and therefore have nothing useful in the backtrace 😖
I'm going to try to replicate this locally. In the meantime, I'm wondering if you could try a few experiments to help us isolate exactly where it's going wrong:
- Remove the
dataloader.with(...).load(...)call, and instead, callProfile.finddirectly. Does that work as expected? (If so, then we can be sure it has to do with the AsyncDataloader -> Source flow.) - When using
dataloader.with(...).load(...), what if you changedef fetchto callusers_ids.map { |id| Profile.find(id) }directly? (Obviously that's bad ActiveRecord, but I'm curious if it behaves differently with a direct query.) - Also, could you try a
rbspy snapshotwhile the query is running? https://rbspy.github.io/using-rbspy/snapshot.html That would let us sneak a peek a the top-level backtrace while it's doing ... whatever it's doing!
Let me know what you find -- I'll try to replicate on my end in the meantime!
I wrote a replication script, but it worked fine:
Querying for currentUser.profile.id with `AsyncDataloader`
require 'bundler/inline'
gemfile do
source "https://rubygems.org"
gem "graphql", "2.2.4"
gem "rails", "7.1.2"
gem "postgresql"
gem "async", "~>2.8.0"
end
require "active_record"
ActiveSupport::IsolatedExecutionState.isolation_level = :fiber
ActiveRecord::Base.establish_connection("postgres://postgres:@localhost/postgres")
ActiveRecord::Schema.define do
self.verbose = false
create_table :profiles, force: true do |t|
t.integer :user_id
end
end
class Profile < ActiveRecord::Base
end
Profile.create!(user_id: 5)
class MySchema < GraphQL::Schema
class ProfileSource < GraphQL::Dataloader::Source
def fetch(users_ids)
profiles = ::Profile.where(user_id: users_ids)
users_ids.map do |uid|
profiles.find { |pr| pr.user_id == uid }
end
end
end
class Profile < GraphQL::Schema::Object
field :id, ID
end
class User < GraphQL::Schema::Object
field :profile, Profile
def profile
dataloader.with(ProfileSource).load(object.id)
end
end
class Query < GraphQL::Schema::Object
field :current_user, User
end
query(Query)
use GraphQL::Dataloader::AsyncDataloader
end
query_str = <<-GRAPHQL
{
currentUser {
profile {
id
}
}
}
GRAPHQL
data = { current_user: OpenStruct.new(id: 5) }
pp MySchema.execute(query_str, root_value: data).to_h
# {"data"=>{"currentUser"=>{"profile"=>{"id"=>"1"}}}}
So... we have to find some more "moving parts" to identify what's going wrong here!
@rmosolgo I ran those tests
Remove the dataloader.with(...).load(...) call, and instead, call Profile.find directly - Works
When using dataloader.with(...).load(...), what if you change def fetch to call users_ids.map { |id| Profile.find(id) } directly? FAIL
For this one I used this implementation:
class Sources::ProfileForUser < GraphQL::Dataloader::Source
def fetch(users_ids)
users_ids.map do |user_id|
Profile.find_by(user_id:)
end
end
end
And it doesn't work, it gets stuck on line 4: Profile.find_by(user_id:) - seems that active record operations will hang inside a dataloader source
here is a snapshot while the query is running:
<main> - /workspaces/flexhire/backend/bin/rails:4
require - /usr/local/rvm/gems/ruby-3.2.2/gems/bootsnap-1.12.0/lib/bootsnap/load_path_cache/core_ext/kernel_require.rb:34
require [c function] - (unknown)
<main> - /usr/local/rvm/gems/ruby-3.2.2/gems/railties-7.1.2/lib/rails/commands.rb:18
invoke - /usr/local/rvm/gems/ruby-3.2.2/gems/railties-7.1.2/lib/rails/command.rb:83
with_argv - /usr/local/rvm/gems/ruby-3.2.2/gems/railties-7.1.2/lib/rails/command.rb:152
block in invoke - /usr/local/rvm/gems/ruby-3.2.2/gems/railties-7.1.2/lib/rails/command.rb:75
perform - /usr/local/rvm/gems/ruby-3.2.2/gems/railties-7.1.2/lib/rails/command/base.rb:74
dispatch - /usr/local/rvm/gems/ruby-3.2.2/gems/thor-1.3.0/lib/thor.rb:528
invoke_command - /usr/local/rvm/gems/ruby-3.2.2/gems/railties-7.1.2/lib/rails/command/base.rb:181
invoke_command - /usr/local/rvm/gems/ruby-3.2.2/gems/thor-1.3.0/lib/thor/invocation.rb:129
run - /usr/local/rvm/gems/ruby-3.2.2/gems/thor-1.3.0/lib/thor/command.rb:38
perform - /usr/local/rvm/gems/ruby-3.2.2/gems/railties-7.1.2/lib/rails/commands/server/server_command.rb:150
tap - unknown:92
block in perform - /usr/local/rvm/gems/ruby-3.2.2/gems/railties-7.1.2/lib/rails/commands/server/server_command.rb:149
start - /usr/local/rvm/gems/ruby-3.2.2/gems/railties-7.1.2/lib/rails/commands/server/server_command.rb:41
start - /usr/local/rvm/gems/ruby-3.2.2/gems/rack-2.2.8/lib/rack/server.rb:328
run - /usr/local/rvm/gems/ruby-3.2.2/gems/puma-6.4.0/lib/rack/handler/puma.rb:82
run - /usr/local/rvm/gems/ruby-3.2.2/gems/puma-6.4.0/lib/puma/launcher.rb:197
run - /usr/local/rvm/gems/ruby-3.2.2/gems/puma-6.4.0/lib/puma/single.rb:67
join [c function] - (unknown)
I am not getting any insight from this snapshot, however to be honest the ruby ecosystem never really clicked with me so I guess you will need to be patient :smile:
Ugh, bummer -- the snapshot just showed the top-level thread waiting for the Fiber to finish. Not helpful 😖
I'm not exactly sure where to look next. I'd say the Ruby ecosystem did click with me, but using Fibers this way is pretty new!
I'm going to think about more debugging options and follow up here.
In the meantime, I documented some other approaches for adding parallelism to GraphQL::Dataloader here: https://graphql-ruby.org/dataloader/parallelism.html. If try any of those out, please let me know how it goes!
👋 It's been a while, but since then, I did discover another issue with AsyncDataloader + Rails (#4921) . In Rails < 7.2, Connections aren't automatically released when their Fiber terminates. As a result, Rails can freeze where its connection pool is full but connections aren't actually released.
To solve this, you can hook into Dataloader's fiber lifecycle with the code here: https://graphql-ruby.org/dataloader/async_dataloader.html#activerecord-connections. That lets you release those connections manually.
@fazo96, do you want to give your app a try with that code? (Or Rails 7.2?)