god icon indicating copy to clipboard operation
god copied to clipboard

Race condition makes "god terminate" intermittently fail

Open willbryant opened this issue 14 years ago • 3 comments

Almost every time I run terminate on one of my 8-CPU daemon servers with a few jobs running and one failing, it doesn't work as it doesn't actually run the stop action. It's pretty messy to debug, but I believe I have found at least a problem, if not the only problem; there's a race condition in the event processing.

The God stop_all method does this:

self.watches.sort.each do |name, w|
  Thread.new do
    w.unmonitor if w.state != :unmonitored
    w.action(:stop) if w.alive?
  end
end

The problem is that that unmonitor call gets added to the driver events queue to be run asynchronously. If the driver happens to get a turn before the next line, things work. But if not - if this stop_all method continues running before driver wakes up and grabs the move(:up, :unmonitored) event from the queue - then the :stop action will get queued immediately behind it in the driver event queue.

Unfortunately, when the driver runs the Task#move(:up, :unmonitored), it does this:

    # cleanup from current state
    self.driver.clear_events

This results in the stop event being cleared from the events queue! Accordingly, the unmonitor happens but the stop doesn't, so the terminate method then rolls on, obliviously waiting for the watch to finish even though it's never been stopped, eventually giving up.

I can see a couple of ways to patch this. The most obvious is to move the unmonitor state transition and stop action into one driver event, but that seems like a bit of a hack.

Why does the code clear the events queue? Do we need to unmonitor before queueing the stop action?

willbryant avatar Jan 06 '10 03:01 willbryant

Good find! I need to dig into the code again and see why it's written that way. I'll try to get this resolved for 0.10.0 or earlier.

mojombo avatar Apr 04 '10 02:04 mojombo

Any progress on this one? Just had another customer project whose terminate is not working, and I think it's probably the same issue.

willbryant avatar May 12 '10 22:05 willbryant

I'm also running into terminate problems. Would be great to know if this is what's causing them. Would sleeping in between the unmonitor and the stop force a context switch? If so, that could be a quick way to determine whether this is the problem I'm seeing.

zaius avatar Aug 18 '10 01:08 zaius