concurrent-ruby Bad test behavior when transient test threads die

Bad test behavior when transient test threads die

Open headius opened this issue 5 years ago • 0 comments

I have been seeing a failure on and off when running concurrent-ruby's specs against JRuby head. I believe this failure in the cylic barrier specs may have a bug:

  1) Concurrent::CyclicBarrier#number_waiting with waiting threads should be equal to the waiting threads count
     Failure/Error: expect(thread_join).not_to be_nil, thread.inspect
       #<Thread:0x15605d83@/home/travis/build/jruby/jruby/concurrent-ruby/spec/support/example_group_extensions.rb:35 aborting>
     # /home/travis/build/jruby/jruby/lib/ruby/gems/shared/gems/rspec-support-3.9.2/lib/rspec/support.rb:97:in `block in Support'
     # /home/travis/build/jruby/jruby/lib/ruby/gems/shared/gems/rspec-support-3.9.2/lib/rspec/support.rb:106:in `notify_failure'
     # /home/travis/build/jruby/jruby/lib/ruby/gems/shared/gems/rspec-expectations-3.9.1/lib/rspec/expectations/fail_with.rb:35:in `fail_with'
     # /home/travis/build/jruby/jruby/lib/ruby/gems/shared/gems/rspec-expectations-3.9.1/lib/rspec/expectations/handler.rb:40:in `handle_failure'
     # /home/travis/build/jruby/jruby/lib/ruby/gems/shared/gems/rspec-expectations-3.9.1/lib/rspec/expectations/handler.rb:72:in `block in handle_matcher'
     # /home/travis/build/jruby/jruby/lib/ruby/gems/shared/gems/rspec-expectations-3.9.1/lib/rspec/expectations/handler.rb:27:in `with_matcher'
     # /home/travis/build/jruby/jruby/lib/ruby/gems/shared/gems/rspec-expectations-3.9.1/lib/rspec/expectations/handler.rb:70:in `handle_matcher'
     # /home/travis/build/jruby/jruby/lib/ruby/gems/shared/gems/rspec-expectations-3.9.1/lib/rspec/expectations/expectation_target.rb:78:in `not_to'
     # ./spec/spec_helper.rb:62:in `block in /home/travis/build/jruby/jruby/concurrent-ruby/spec/spec_helper.rb'
     # org/jruby/RubyBasicObject.java:2694:in `instance_exec'
     # /home/travis/build/jruby/jruby/lib/ruby/gems/shared/gems/rspec-core-3.9.1/lib/rspec/core/example.rb:450:in `instance_exec'

The failure above is confusing because it's reporting one description and failing on some other line. Specifically, the failure error says it is waiting for a thread to join (which is the join_with logic used around line 200 in cyclic_barrier_spec.rb), but the spec description reports a "waiting threads count" spec has failed.

Digging into both specs, I think the in_thread logic in the spec helpers may be flawed. Here's the waiting threads spec:

      context 'with waiting threads' do
        it 'should be equal to the waiting threads count' do
          in_thread { barrier.wait }
          in_thread { barrier.wait }
          repeat_until_success { expect(barrier.number_waiting).to eq 2 }
        end
      end

It starts up two threads to wait on the barrier and then checks that they eventually show up as waiting. I believe this spec passes fine, but those threads are little time bombs due to this code in in_thread:

    def in_thread(*arguments, &block)
      @created_threads ||= Queue.new
      new_thread       = Thread.new(*arguments) do |*args, &b|
        Thread.abort_on_exception = true
        block.call(*args, &b)
      end
      @created_threads.push new_thread
      new_thread
    end

This logic creates a new thread and adds it to a queue to be shut down later. But strangely, this code also sets the global abort_on_exception flag to true.

This combine with the shutdown code is likely leading to unexpected results:

  config.after :each do
    while defined?(@created_threads) && @created_threads && (thread = (@created_threads.pop(true) rescue nil))
      thread.kill
      thread_join = thread.join(0.25)
      if thread_join.nil?
        require 'jruby'
        puts JRuby.ref(thread).getNativeThread.getStackTrace.to_a
      end
      expect(thread_join).not_to be_nil, thread.inspect
    end
  end

My theory is that the failure above is due to one of these abandoned threads being lazily killed. The call to thread.kill causes the thread's barrier wait to be interrupted, raising an error and eventually terminating the thread. Because of abort_on_exception, whatever the main thread is running at that point (like a thread.join in join_with) will be interrupted, and that might happen while running other specs, as is the case here.

I don't think we should be using abort_on_exception at all. If these threads are expected to run to completion, we should be testing for that. We should not allow a spec in one thread to abort the entire spec run because it happened to bubble out an error... especially when these threads are transient and being forcibly killed.

The trivial patch here would be to remove the abort_on_exception call:

diff --git a/spec/support/example_group_extensions.rb b/spec/support/example_group_extensions.rb
index 8c165feb..d64e924c 100644
--- a/spec/support/example_group_extensions.rb
+++ b/spec/support/example_group_extensions.rb
@@ -33,7 +33,6 @@ module Concurrent
     def in_thread(*arguments, &block)
       @created_threads ||= Queue.new
       new_thread       = Thread.new(*arguments) do |*args, &b|
-        Thread.abort_on_exception = true
         block.call(*args, &b)
       end
       @created_threads.push new_thread

Alternatively, we could keep the abort logic if we gracefully shut down all of these transient threads, rather than forcibly killing them.

I will be disabling the concurrent-ruby suite for JRuby's CI until we can resolve this.

Apr 08 '20 04:04 headius

concurrent-ruby concurrent-ruby copied to clipboard

Bad test behavior when transient test threads die

concurrent-ruby
concurrent-ruby copied to clipboard