workers icon indicating copy to clipboard operation
workers copied to clipboard

Race Condition with Nested Timers Causing Deadlock

Open movitto opened this issue 6 years ago • 4 comments

Thanks for the great library, I've been using it in a local project.

I've encountered an issue where a infrequent race condition is causing Threads to deadlock unexpectedly. Specifically, if a context switch occurs during the course of Scheduler#process_overdue in lib/workers/scheduler.rb, betweens the call to @pool.perform and timer.reset, and we switch to a thread that attempts to launch a timer:

  • the Scheduler's mutex will already be locked (in #start_loop which does this before invoking #process_overdue)
  • the Schedule will schedule the worker via pool.perform then the Schedule Thread loses context
  • the Timer will lock it's own mutex during the call to #fire
  • The new Timer will be launched and will block when it tries to lock the Scheduler mutex (already locked)
  • When the Scheduler Thread returns to context, it will attempt to invoke Timer#reset, which will block attempting to lock the Timer mutex resulting in a deadlock.

If there are other threads running in the pool, no Deadlock error will be raised (I believe ruby will not be able to detect this)

The critical section in question is here: https://github.com/chadrem/workers/blob/master/lib/workers/scheduler.rb#L82

We can demonstrate it with the following code:

require 'workers'

timer = Workers::Timer.new(0.01) do
  puts "T1"

  Workers::Timer.new(0.01)  do
    puts "T2"
  end
end

sleep(5)

Note that unless a context switch occurs exactly between the call to @pool.perform and timer.reset, this issue will not be apparent and the example above will execute successfully. To force this context switch simply add a 'sleep' to the #process_overdue function (not meant for production, just for testing / demonstrating this issue):

      overdue.each do |timer|
        @pool.perform(timer.callback) do
          timer.fire
        end
sleep 0.01   # <===========   force a "context switch"

        timer.reset
        @schedule << timer if timer.repeat
      end

During normal operation, when the context switch does not occur, both Timers run succesfully and all output is seen. During the buggy edge case, we only see the first timer execute, before the mutexs block until the process exits.

movitto avatar Mar 28 '18 20:03 movitto

@movitto great analysis of the problem and also for coming up with a simple demo. Nesting timers wasn't something I needed or designed for. Time for this project is very limited these days so I can't look into this. If you come up with a good solution then please send a PR. My recommendation though is look into if you can avoid nesting the timers.

chadrem avatar Mar 28 '18 20:03 chadrem

@chadrem thx for the quick response. I tried changing the order of operation, eg resetting the timer before firing the callback and the issue seemed to be resolved, even with the 'artificial' forced context switch.

PR sent. Let me know if this won't work for whatever reason and we can try to figure something else out.

movitto avatar Mar 29 '18 12:03 movitto

This issue is most likely solved in PR #10 (leaving a note here in case anyone else runs into it).

chadrem avatar Mar 30 '18 04:03 chadrem

@movitto I started review your code and I am thinking about a potential different solution. I need to dig in deeper, but upon first looking at it, it may be possible to get rid of @mutex in Timer.rb. If so, this would simplify the code since there would only be one mutex in Scheduler.rb and thus a lot less chance of deadlocks. I'm not going to make any changes until I inspect all the code and verify where locks are/aren't needed.

chadrem avatar Mar 31 '18 18:03 chadrem