god icon indicating copy to clipboard operation
god copied to clipboard

god status says process is "up" when it's continuously restarting

Open coofercat opened this issue 8 years ago • 1 comments

(I'm really a devops/sysadmin, so God isn't really my area of expertise, so apologies if I'm asking stupid questions here)

We've just had a situation where a missing dependency means that our Resque scheduler won't start on some machines (it throws a terminal exception on startup). However, god status always says resque-scheduler: up (and so our monitoring hasn't picked this up, and we didn't know there was a problem). It seems we attempt to restart the scheduler every 5 seconds, I won't paste the whole stack trace, but the scheduler says this on startup:

rake aborted!
cannot load such file -- tzinfo/indexes/timezones
...
Tasks: TOP => resque:scheduler
(See full trace by running task with --trace)

...which I assume means the process would have returned a non-zero exit code (if that matters).

Our god config for the scheduler looks like this:

 God.watch do |w|
  w.dir          = RAILS_ROOT
  w.name         = "resque-scheduler"
  w.stop_signal  = 'QUIT'
  w.env          = {"RAILS_ENV"=>RAILS_ENV}
  w.interval     = 5.seconds
  w.start        = "rake resque:scheduler"
  w.err_log      = "#{RAILS_ROOT}/log/resque-scheduler_error.log"
  w.log          = "#{RAILS_ROOT}/log/resque-scheduler.log"
  w.uid          = DEFAULT_RUNAS_USER
  w.gid          = DEFAULT_RUNAS_GROUP

  w.transition(:up, :restart) do |on|
    on.condition(:memory_usage) do |c|
      c.above = 350.megabytes
      c.times = 2
    end
  end

  w.transition(:init, { true => :up, false => :start }) do |on|
    on.condition(:process_running) do |c|
      c.running = true
    end
  end

  # determine when process has finished starting
  w.transition([:start, :restart], :up) do |on|
    on.condition(:process_running) do |c|
      c.running = true
      c.interval = 5.seconds
    end

    # failsafe
    on.condition(:tries) do |c|
      c.times = 5
      c.transition = :start
      c.interval = 5.seconds
    end
  end

  # start if process is not running
  w.transition(:up, :start) do |on|
    on.condition(:process_running) do |c|
      c.running = false
    end
  end
end

What strategies can we employ to make god status say something other than "up" when this sort of thing happens? Obviously, if the process stops we want to restart it as quickly as possible, but if it's just continuously restarting, we'd like to catch that situation in some way.

Versions in use:

$ god --version
Version 0.13.7
$ ruby --version
ruby 2.1.10p492 (2016-04-01 revision 54464) [x86_64-linux]

coofercat avatar Nov 20 '17 12:11 coofercat