truffleruby
truffleruby copied to clipboard
Autoloader racing on multiple threads
Autoloader use with multiple threads cause premature use of the class.
Example
When executing mutliple Rails
tests with the debugger: jt --use jvm-ce ruby --jdebug -S ./bin/rails test test/folder
the execution fails with:
#<Thread:0x390d8 @active_support_execution_state={:active_support_execution_context=>{}} /home/itarato/truffleruby-ws/graal/sdk/mxbuild/linux-amd64/GRAALVM_A33290E019_JAVA17/graalvm-a33290e019-java17-23.1.0-dev/languages/ruby/lib/gems/gems/minitest-5.18.0/lib/minitest/parallel.rb:28 run> terminated with exception (report_on_exception is true):
/home/itarato/truffleruby-ws/graal/sdk/mxbuild/linux-amd64/GRAALVM_A33290E019_JAVA17/graalvm-a33290e019-java17-23.1.0-dev/languages/ruby/lib/gems/gems/activesupport-7.0.4.3/lib/active_support/railtie.rb:54:in `block (2 levels) in <class:Railtie>': undefined method `reset_all' for ActiveSupport::CurrentAttributes:Class (NoMethodError)
Did you mean? reset_callbacks
from /home/itarato/truffleruby-ws/graal/sdk/mxbuild/linux-amd64/GRAALVM_A33290E019_JAVA17/graalvm-a33290e019-java17-23.1.0-dev/languages/ruby/lib/gems/gems/activesupport-7.0.4.3/lib/active_support/callbacks.rb:445:in `instance_exec'
from /home/itarato/truffleruby-ws/graal/sdk/mxbuild/linux-amd64/GRAALVM_A33290E019_JAVA17/graalvm-a33290e019-java17-23.1.0-dev/languages/ruby/lib/gems/gems/activesupport-7.0.4.3/lib/active_support/callbacks.rb:445:in `block in make_lambda'
from /home/itarato/truffleruby-ws/graal/sdk/mxbuild/linux-amd64/GRAALVM_A33290E019_JAVA17/graalvm-a33290e019-java17-23.1.0-dev/languages/ruby/lib/gems/gems/activesupport-7.0.4.3/lib/active_support/callbacks.rb:199:in `block (2 levels) in halting'
from /home/itarato/truffleruby-ws/graal/sdk/mxbuild/linux-amd64/GRAALVM_A33290E019_JAVA17/graalvm-a33290e019-java17-23.1.0-dev/languages/ruby/lib/gems/gems/activesupport-7.0.4.3/lib/active_support/callbacks.rb:687:in `block (2 levels) in default_terminator'
from <internal:core> core/throw_catch.rb:36:in `catch'
from /home/itarato/truffleruby-ws/graal/sdk/mxbuild/linux-amd64/GRAALVM_A33290E019_JAVA17/graalvm-a33290e019-java17-23.1.0-dev/languages/ruby/lib/gems/gems/activesupport-7.0.4.3/lib/active_support/callbacks.rb:686:in `block in default_terminator'
The test executor spawns threads to spread the test execution, and most threads fail with this error (not all). What's happening is the very first thread is referring to ActiveSupport::CurrentAttributes
, an unknown constant. The autoload definition in https://github.com/rails/rails/blob/2bfb6565a4996a19065e03300d9061e9eb265d92/activesupport/lib/active_support.rb#L42 delegates the resolution to Zeitwerk and it starts loading the class (snippet from the first thread load backtrace):
/home/itarato/truffleruby-ws/graal/sdk/mxbuild/linux-amd64/GRAALVM_A33290E019_JAVA17/graalvm-a33290e019-java17-23.1.0-dev/languages/ruby/lib/gems/gems/activesupport-7.0.4.3/lib/active_support/current_attributes.rb:4:in `/'
/home/itarato/truffleruby-ws/graal/sdk/mxbuild/linux-amd64/GRAALVM_A33290E019_JAVA17/graalvm-a33290e019-java17-23.1.0-dev/languages/ruby/lib/gems/gems/activesupport-7.0.4.3/lib/active_support/current_attributes.rb:4:in `<top (required)>'
<internal:core> core/kernel.rb:234:in `gem_original_require'
/home/itarato/truffleruby-ws/graal/sdk/mxbuild/linux-amd64/GRAALVM_A33290E019_JAVA17/graalvm-a33290e019-java17-23.1.0-dev/languages/ruby/lib/gems/gems/zeitwerk-2.6.8/lib/zeitwerk/kernel.rb:38:in `require'
/home/itarato/truffleruby-ws/graal/sdk/mxbuild/linux-amd64/GRAALVM_A33290E019_JAVA17/graalvm-a33290e019-java17-23.1.0-dev/languages/ruby/lib/gems/gems/activesupport-7.0.4.3/lib/active_support/railtie.rb:49:in `block (2 levels) in <class:Railtie>'
Meanwhile the other threads will just use the class as if it's already loaded - which is not the case, the class's Ruby code is still executing (function definitions) so some threads won't have all the methods available: hence the error. This was confirmed by logging file execution start/end.
Reproduce
- make a new rails project with truffle ruby
- add a bunch of scaffolds, eg:
rails generate scaffold PostA name:string title:string content:text value:integer
- call tests with
--jdebug
Minimum example
Unfortunately reproducing it with a minimum example was tricky. It seems using Module#autolad
vs ActiveSupport::Autoload#autoload
was the same. I couldn't replicate it with --jvm-ce
but saw something similar with --native
.
Code
# Bait file.
class Foo
sleep(0.1)
def bar; end
end
require("thread")
autoload("Foo", "./foo.rb")
thread = [nil] * 8
thread[0] = Thread.new { Foo.new.bar }
thread[1] = Thread.new { Foo.new.bar }
thread[2] = Thread.new { Foo.new.bar }
thread[3] = Thread.new { Foo.new.bar }
thread[4] = Thread.new { Foo.new.bar }
thread[5] = Thread.new { Foo.new.bar }
thread[6] = Thread.new { Foo.new.bar }
thread[7] = Thread.new { Foo.new.bar }
thread.each(&:join)
Example output (6 failure, 2 success):
#<Thread:0x148 try_break_autoload.rb:26 run> terminated with exception (report_on_exception is true):
try_break_autoload.rb:26:in `block in <main>': undefined method `bar' for #<Foo:0x158> (NoMethodError)
#<Thread:0x138 try_break_autoload.rb:25 run> terminated with exception (report_on_exception is true):
try_break_autoload.rb:25:in `block in <main>': undefined method `bar' for #<Foo:0x168> (NoMethodError)
Observation
- Rails on CRuby does not run tests on multiple threads
- CRuby and Truffle with
--jvm-ce
does not reproduce the error with the snippet above - I'm using
--jdebug
on the Rails test run because without it there is a different racing error (tracked separately), probably the debugger slows it down enough so the first error does not occur that often - this error shows with/without compilation (
--engine.Compilation=false
)
I think you are seeing https://github.com/oracle/truffleruby/issues/2431. That issue has more details on what's currently done and what's missing. The problem is we do synchronize so only one thread autoloads, but we also publish the constant as soon as it's set which means other threads can see it at that point, before the file is executed. I guess when the autoload constant is set we should consider it set only for the current thread doing the file loading, and then publish it when the current thread finishes loading that file.
Thanks for the detailed issue and reproduction.
Didn't get too far, still got confused by the nodes layout in execution. I've used the following code for debugging (safe enough to avoid racing but still threaded so the same-thread-locking guards can be tested):
require("thread")
autoload("Foo", "./foo.rb")
Foo.new.bar
Thread.new { Foo.new.bar }.join
At the first Foo.new.bar
on the main thread it loads LookupConstantWithLexicalScopeNode
which calls GetConstantNode
. This is where Object
has the constant field for Foo
with autoload config. Here it starts the require: https://github.com/oracle/truffleruby/blob/7da015fbb94d880d19e1aca03d0343f7c1d570c9/src/main/java/org/truffleruby/language/constants/GetConstantNode.java#L124 This goes down the stack and calls GetConstantNode
again (couldn't figure out why it needs a second one). This eventually replaces the constant entry for Object
: https://github.com/oracle/truffleruby/blob/c1702b035af3f35b6080e3179d456407865c4291/src/main/java/org/truffleruby/core/module/ModuleFields.java#L487. And now this can return to the parent GetConstantNode
and resolve the autoload lock: https://github.com/oracle/truffleruby/blob/7da015fbb94d880d19e1aca03d0343f7c1d570c9/src/main/java/org/truffleruby/language/constants/GetConstantNode.java#L129
Now when we go to the second lookup for Thread.new { Foo.new.bar }.join
the LookupConstantWithLexicalScopeNode
returns the new constant declaration for Foo
which will just use the final Foo
: https://github.com/oracle/truffleruby/blob/7da015fbb94d880d19e1aca03d0343f7c1d570c9/src/main/java/org/truffleruby/language/constants/GetConstantNode.java#L74
Only had vague ideas how to approach it:
- For final constants, when replacing the autoloaded ones for a module, a property could be set indicating that it's not completed (loaded). Once
GetConstantNode
is completed and locks are resolved, this property could be switched (to completed=true). One small challenge is that the autoloader inGetConstantNode
would need to look this up to update it. And a bigger challenge is what to do with the other const loader that waits for the in-progress original load to finish. Busy sleep? Semaphore?
I'm looking for some guidance on a better approach here.
One idea is for the constant being autoloaded to not set it yet but either store it in a new field of AutoloadConstant or maybe in the RubyConstant#value of the autoload RubyConstant. And then we can probably reuse the existing autoload locks, so as long as the other thread sees it's not set yet it will try to acquire the autoload lock and wait, so that's fine. And we have AutoloadConstant#isAutoloadingThread() to know which thread is allowed to see the value.
A reliable reproducer:
# main.rb
autoload :C, "#{__dir__}/c.rb"
$inside = false
t = Thread.new { p C.new.foo }
Thread.pass until $inside
p C.new.foo
t.join
# c.rb
class C
def bar
end
$inside = true
sleep 1
def foo
42
end
end
I tried an edge case, what if while loading the file we create another thread which tries to access the being-autoloaded constant? Answer: it deadlocks on CRuby:
# c.rb
class C
def bar
end
Thread.new { p C }.join
$inside = true
sleep 1
def foo
42
end
end
So it's simple, we don't need to handle that case, it can just hang.
If we simplify main.rb we can get CRuby to print the deadlock:
# main.rb
autoload :C, "#{__dir__}/c.rb"
p C.new.foo
gives
/home/eregon/code/truffleruby-ws/truffleruby/autoload/c.rb:5:in `join': No live threads left. Deadlock? (fatal)
2 threads, 2 sleeps current:0x0000000001bdfae0 main thread:0x0000000001b03060
* #<Thread:0x00007f26bfe1fcc0 sleep_forever>
rb_thread_t:0x0000000001b03060 native:0x00007f26bff40740 int:0
* #<Thread:0x00007f26bfdd8eb0 /home/eregon/code/truffleruby-ws/truffleruby/autoload/c.rb:5 sleep_forever>
rb_thread_t:0x0000000001bdfae0 native:0x00007f26af30b6c0 int:0
depended by: tb_thread_id:0x0000000001b03060
from /home/eregon/code/truffleruby-ws/truffleruby/autoload/c.rb:5:in `<class:C>'
from /home/eregon/code/truffleruby-ws/truffleruby/autoload/c.rb:1:in `<top (required)>'
from <internal:/home/eregon/.rubies/ruby-3.1.3/lib/ruby/3.1.0/rubygems/core_ext/kernel_require.rb>:85:in `require'
from <internal:/home/eregon/.rubies/ruby-3.1.3/lib/ruby/3.1.0/rubygems/core_ext/kernel_require.rb>:85:in `require'
from main.rb:2:in `<main>'
With Fibers on 3.2, with Fiber.new { p C }.resume
instead of Thread.new { p C }.join
it's:
c.rb:5:in `block in <class:C>': deadlock; lock already owned by another fiber belonging to the same thread (ThreadError)
So just an early error but still a deadlock. On 3.1 it seems to use a per-Ruby-Thread lock and it "works", but let's use 3.2 semantics it's far easier for TruffleRuby.
I have a fix for this in https://github.com/oracle/truffleruby/pull/3078, I'm still polishing it a bit.
Notably there is the case of module M; autoload :OpenSSL, "openssl"; p OpenSSL; end
I need to review.
I also noticed we still have the notion of "undefined constants", but this should have been removed in CRuby 3.1, so now we should be able to clean that up as well. But probably best done separately, it's already quite a complex change as-is.