concurrent-ruby
concurrent-ruby copied to clipboard
Bad test behavior when transient test threads die
I have been seeing a failure on and off when running concurrent-ruby's specs against JRuby head. I believe this failure in the cylic barrier specs may have a bug:
1) Concurrent::CyclicBarrier#number_waiting with waiting threads should be equal to the waiting threads count
Failure/Error: expect(thread_join).not_to be_nil, thread.inspect
#<Thread:0x15605d83@/home/travis/build/jruby/jruby/concurrent-ruby/spec/support/example_group_extensions.rb:35 aborting>
# /home/travis/build/jruby/jruby/lib/ruby/gems/shared/gems/rspec-support-3.9.2/lib/rspec/support.rb:97:in `block in Support'
# /home/travis/build/jruby/jruby/lib/ruby/gems/shared/gems/rspec-support-3.9.2/lib/rspec/support.rb:106:in `notify_failure'
# /home/travis/build/jruby/jruby/lib/ruby/gems/shared/gems/rspec-expectations-3.9.1/lib/rspec/expectations/fail_with.rb:35:in `fail_with'
# /home/travis/build/jruby/jruby/lib/ruby/gems/shared/gems/rspec-expectations-3.9.1/lib/rspec/expectations/handler.rb:40:in `handle_failure'
# /home/travis/build/jruby/jruby/lib/ruby/gems/shared/gems/rspec-expectations-3.9.1/lib/rspec/expectations/handler.rb:72:in `block in handle_matcher'
# /home/travis/build/jruby/jruby/lib/ruby/gems/shared/gems/rspec-expectations-3.9.1/lib/rspec/expectations/handler.rb:27:in `with_matcher'
# /home/travis/build/jruby/jruby/lib/ruby/gems/shared/gems/rspec-expectations-3.9.1/lib/rspec/expectations/handler.rb:70:in `handle_matcher'
# /home/travis/build/jruby/jruby/lib/ruby/gems/shared/gems/rspec-expectations-3.9.1/lib/rspec/expectations/expectation_target.rb:78:in `not_to'
# ./spec/spec_helper.rb:62:in `block in /home/travis/build/jruby/jruby/concurrent-ruby/spec/spec_helper.rb'
# org/jruby/RubyBasicObject.java:2694:in `instance_exec'
# /home/travis/build/jruby/jruby/lib/ruby/gems/shared/gems/rspec-core-3.9.1/lib/rspec/core/example.rb:450:in `instance_exec'
The failure above is confusing because it's reporting one description and failing on some other line. Specifically, the failure error says it is waiting for a thread to join (which is the join_with logic used around line 200 in cyclic_barrier_spec.rb), but the spec description reports a "waiting threads count" spec has failed.
Digging into both specs, I think the in_thread logic in the spec helpers may be flawed. Here's the waiting threads spec:
context 'with waiting threads' do
it 'should be equal to the waiting threads count' do
in_thread { barrier.wait }
in_thread { barrier.wait }
repeat_until_success { expect(barrier.number_waiting).to eq 2 }
end
end
It starts up two threads to wait on the barrier and then checks that they eventually show up as waiting. I believe this spec passes fine, but those threads are little time bombs due to this code in in_thread:
def in_thread(*arguments, &block)
@created_threads ||= Queue.new
new_thread = Thread.new(*arguments) do |*args, &b|
Thread.abort_on_exception = true
block.call(*args, &b)
end
@created_threads.push new_thread
new_thread
end
This logic creates a new thread and adds it to a queue to be shut down later. But strangely, this code also sets the global abort_on_exception flag to true.
This combine with the shutdown code is likely leading to unexpected results:
config.after :each do
while defined?(@created_threads) && @created_threads && (thread = (@created_threads.pop(true) rescue nil))
thread.kill
thread_join = thread.join(0.25)
if thread_join.nil?
require 'jruby'
puts JRuby.ref(thread).getNativeThread.getStackTrace.to_a
end
expect(thread_join).not_to be_nil, thread.inspect
end
end
My theory is that the failure above is due to one of these abandoned threads being lazily killed. The call to thread.kill causes the thread's barrier wait to be interrupted, raising an error and eventually terminating the thread. Because of abort_on_exception, whatever the main thread is running at that point (like a thread.join in join_with) will be interrupted, and that might happen while running other specs, as is the case here.
I don't think we should be using abort_on_exception at all. If these threads are expected to run to completion, we should be testing for that. We should not allow a spec in one thread to abort the entire spec run because it happened to bubble out an error... especially when these threads are transient and being forcibly killed.
The trivial patch here would be to remove the abort_on_exception call:
diff --git a/spec/support/example_group_extensions.rb b/spec/support/example_group_extensions.rb
index 8c165feb..d64e924c 100644
--- a/spec/support/example_group_extensions.rb
+++ b/spec/support/example_group_extensions.rb
@@ -33,7 +33,6 @@ module Concurrent
def in_thread(*arguments, &block)
@created_threads ||= Queue.new
new_thread = Thread.new(*arguments) do |*args, &b|
- Thread.abort_on_exception = true
block.call(*args, &b)
end
@created_threads.push new_thread
Alternatively, we could keep the abort logic if we gracefully shut down all of these transient threads, rather than forcibly killing them.
I will be disabling the concurrent-ruby suite for JRuby's CI until we can resolve this.