god
god copied to clipboard
Race condition makes "god terminate" intermittently fail
Almost every time I run terminate on one of my 8-CPU daemon servers with a few jobs running and one failing, it doesn't work as it doesn't actually run the stop action. It's pretty messy to debug, but I believe I have found at least a problem, if not the only problem; there's a race condition in the event processing.
The God stop_all method does this:
self.watches.sort.each do |name, w|
Thread.new do
w.unmonitor if w.state != :unmonitored
w.action(:stop) if w.alive?
end
end
The problem is that that unmonitor call gets added to the driver events queue to be run asynchronously. If the driver happens to get a turn before the next line, things work. But if not - if this stop_all method continues running before driver wakes up and grabs the move(:up, :unmonitored) event from the queue - then the :stop action will get queued immediately behind it in the driver event queue.
Unfortunately, when the driver runs the Task#move(:up, :unmonitored), it does this:
# cleanup from current state
self.driver.clear_events
This results in the stop event being cleared from the events queue! Accordingly, the unmonitor happens but the stop doesn't, so the terminate method then rolls on, obliviously waiting for the watch to finish even though it's never been stopped, eventually giving up.
I can see a couple of ways to patch this. The most obvious is to move the unmonitor state transition and stop action into one driver event, but that seems like a bit of a hack.
Why does the code clear the events queue? Do we need to unmonitor before queueing the stop action?
Good find! I need to dig into the code again and see why it's written that way. I'll try to get this resolved for 0.10.0 or earlier.
Any progress on this one? Just had another customer project whose terminate is not working, and I think it's probably the same issue.
I'm also running into terminate problems. Would be great to know if this is what's causing them. Would sleeping in between the unmonitor and the stop force a context switch? If so, that could be a quick way to determine whether this is the problem I'm seeing.