god icon indicating copy to clipboard operation
god copied to clipboard

Threads preventing stop_all from working completely

Open bricker opened this issue 11 years ago • 15 comments
trafficstars

Using:

  • god 0.13.4
  • ruby 2.1.1
  • ubuntu 12.04

Trying to run god terminate usually fails to terminate all of the watches (currently only 3 watches are active on this server). It's inconsistent how many watches are successfully killed.

When I comment out these Thread lines, terminate works perfectly: https://github.com/mojombo/god/blob/df3bcfa5af6504b6a5858a02d37699c8c821fb2c/lib/god.rb#L491-L494

What could cause this?

bricker avatar Sep 14 '14 06:09 bricker

I don't have any first-hand knowledge if god works with ruby 2.1. Someone else will have to confirm.

eric avatar Sep 14 '14 06:09 eric

Another interesting development: Removing the Thread lines still didn't consistently stop all of the watches. I had to add some logging into the method. With the logging and the Thread usage removed, termination works consistently every time:

module God
  def self.stop_all
    self.watches.sort.each do |name, w|
      # Added this log line:
      applog(self, :info, "Stopping #{name}...")
      # Thread.new do
        w.unmonitor if w.state != :unmonitored
        w.action(:stop) if w.alive?
      # end
    end

    terminate_timeout.times do
      return true unless self.watches.map { |name, w| w.alive? }.any?
      sleep 1
    end

    return false
  end
end

Maybe these two changes are incidental, but I know that they work.

bricker avatar Sep 14 '14 07:09 bricker

It could be that access to some variables need to be synchronized to ensure CPU caches are properly invalidated.

Another possible issue is that the stop duration for your actions are longer than the terminate_timeout.

eric avatar Sep 14 '14 21:09 eric

We are also using god with ruby 2.1, and experience this issue occasionally. Not entirely sure what's causing it, but it looks like the threaded code could be responsible: if 10 seconds elapse before one of the threads gets to the w.action(:stop) if w.alive? line, then the kill signal(s) will never be sent. At the same time, it seems unlikely that it would take 10 seconds to to process stop signals for three watches. Moreover, if that analysis is correct, I don't think ruby 2.1 vs 1.9.3 would make a difference, since there are no relevant changes to threading in ruby 2 that I'm aware of.

No idea why logging would have any impact on this.

One potential workaround would be to call god stop <watch/group_name> before god terminate, which I believe would not be impacted by the 10s timeout problem that I suspect is the culprit. I think this should probably be done anyway, since the default 10s timeout is set regardless of stop_timeouts you may have specified in your god watches, and I don't think there's any reason for god terminate to have to be aware of this value for each watch.

I'm trying to figure out how to write some higher-level tests (verification/system tests) that might confirm this behaviour, and am open to suggestions if anyone has them. Unless I'm missing something obvious, it does not appear that god currently has any tests of this type.

lukeasrodgers avatar Dec 17 '14 15:12 lukeasrodgers

I was able to reproduce the problem with the simple example (http://godrb.com) and with this configuration file:

(1..20).each {|i|
  God.watch do |w|
    w.name = "simple#{i}"
    w.start = "ruby /Users/carsonreinke/Downloads/god-189/simple.rb #{i}"
    w.keepalive
  end
}

I noticed sometimes it would work and sometimes it would not. In the cases it would not, the process would be left alone and no signal actually sent, however, the thread actually completed.

I added some extra logging and the log output is available here. The watch "simple19" in the example failed to stop.

EDIT: Sorry, forgot to mention, using Ruby 2.1.5.

carsonreinke avatar Dec 22 '14 14:12 carsonreinke

I actually monkeypatch this in my deployment of god. Here's my fix:

def self.stop_all
  self.watches.sort.each do |name, w|
    Thread.new do
      w.unmonitor if w.state != :unmonitored
      until w.state == :unmonitored do
        sleep 1
      end
      w.action(:stop) if w.alive?

.... the rest of the method is the same. This fix seems a bit clumsy. There are a few places in the code where these asynchronous events can bite you. I think there should be a watch.synchronize method or some such.

rcresswell avatar Dec 22 '14 16:12 rcresswell

@rcresswell is there a reason why you couldn't run god stop <watchname/group> before running god terminate?

I think part of the issue people are having here may be a legitimate bug, and part of it is expecting god terminate to do things it isn't designed to do, and probably shouldn't do.

lukeasrodgers avatar Dec 22 '14 16:12 lukeasrodgers

The problem appears to be God::Driver and the it's thread never empties God::DriverEventQueue (or something else I have not figured out). It is evident since there is no log entry of the SIGTERM being sent when the failure happens, so basically the signal is just never sent.

Adding a sleep or removing the threads would just be an intermittent hack, it still will not fix the issue.

The only way I can see this correctly working is stop_all ensuring that all the driver queues are empty before proceeding (or using god stop of course).

carsonreinke avatar Dec 22 '14 16:12 carsonreinke

@lukeasrodgers I use god in rather creative ways, dynamically adding and removing watches. That's why I haven't made a pull request -- I think I am not a common use case.

In any case, god terminate is clearly trying to stop all the processes, it does not seem unreasonable to think that it would succeed in doing so!

rcresswell avatar Dec 22 '14 17:12 rcresswell

@rcresswell I suppose there are arguments both ways. The way I see it, expecting god terminate to work in this way runs into difficulty in this scenario:

Supposed I have a god watch with a stop_timeout of 5 minutes, because god is monitoring something that takes a long time to gracefully shut down, and a graceful shutdown is important. In that case, for god terminate to ensure this watch was stopped correctly, it would have to know about that 300s timeout, and honour it. Supposing I had 10 different watches, all with different timeouts, some being added dynamically, then god terminate would then have to track all these timeouts, and figure out which was the greatest, and never timeout before it had given all watches adequate time to stop themselves. In this case, God::TERMINATE_TIMEOUT_DEFAULT becomes meaningless: the termination timeout will always have to be equal to the greatest watch timeout.

This strikes me as overloading god terminate with responsibilities, but you're right that it is already trying to stop all the processes, so it's not unreasonable to expect that it would actually stop them all.

lukeasrodgers avatar Dec 22 '14 17:12 lukeasrodgers

I believe I found the issue, though I am not sure exactly why.

Looking at God#control, it requests a stop before the unmonitor. So, when I updated stop_all to this, I was not able to reproduce the issue:

    self.watches.sort.each do |name, w|
      applog(self, :info, "Stopping #{name}...")
      Thread.new do
        w.action(:stop)
        w.unmonitor if w.state != :unmonitored
      end
    end

carsonreinke avatar Dec 22 '14 20:12 carsonreinke

I was the having the same problem. I started to trace it down to the Driver communication too (we send the message but the Driver never receives it). @carsonreinke 's solution seems to work for me! Ruby 1.9's end of life is less than two weeks -- February 23, 2015 -- it would be nice it god works correctly with a supported version of Ruby after then.

As for running 'stop' before 'terminate' -- when you do a terminate it makes it pretty clear it is trying to stopping all watches, wait for them to end, and then ending the god completely.

evanbattaglia avatar Feb 11 '15 01:02 evanbattaglia

I am currently having the same problem: Terminating inconsistently fails to stop all watches (although I have already increased the timeout). Sending a TERM signal to the still running processes via kill is usually enough but it seems that they have not received a signal before that - so it got lost somewhere.

This is something I am regularly struggling with as I terminate and restart god on every deployment.

I this issue related to #9?

groe avatar Mar 24 '15 10:03 groe

Any news on this?

groe avatar Sep 21 '15 17:09 groe

no sorry

Date: Mon, 21 Sep 2015 10:16:43 -0700 From: [email protected] To: [email protected] Subject: Re: [god] Threads preventing stop_all from working completely (#189)

Any news on this?

— Reply to this email directly or view it on GitHub.

MaDaMSchat avatar Sep 22 '15 09:09 MaDaMSchat