Method for restarting all managed tasks
Hi!
I'm using Angel to manage ~150 workers across a dozen servers for my project Dreamwidth (www.dreamwidth.org). We used to use our own custom software, but decided to switch to Angel recently and are quite happy with it.
One thing that our old system had that Angel does not is the ability to easily say "I need all of the workers to die and restart." This is done in our case because sometimes a core library has changed and we need everybody to pick it up globally at roughly the same time (think deployments).
Related, this should also enable clean shutdowns of Angel. Sometimes I want to deprecate a machine or take it out for maintenance, so it'd be nice to kill the processes and Angel cleanly.
I'm happy to provide design feedback or suggestions if that'd be helpful.
Hi!
Glad you're getting use out of Angel! I'm going to do the library maintainer thing now and put on my "can we do this easily" hat. In a recent release of Angel I added a pidfile option and a count option. If you specify a count of 2 and a pidfile of worker.pid, Angel should expand that into worker-1.pid, worker-2.pid, etc. I'm updating the docs now to reflect this.
So given this, would it work when you deploy new code to do something like 'cat /your/app/pids/yourapp-*.pid | xargs kill'. That should send a SIGINT to your apps, they can cleanly shut down and then angel will restart them.
At work, we have a directory structure with the last 5 or so deploys and then a current symlink. We deploy a new directory in here, change the symlink, and kill the processes, which come back up configured to look at the current dir.
With regards to safe shutdowns, the Angel code has a handler for SIGINT which empties out its config state, which triggers its process supervisor to shut down all processes it is monitoring. That uses terminateProcess which the documentation says sends a SIGTERM. Currently, however, it doesn't verify the process has gone away or bash on it some more.
Thoughts?
Hrmmm, Using the pidfiles as input to my own management for killing/restarting/etc is fine. I can do that, so let's consider that perfectly fine for that. We also use the symlink trick for deploys
The second half re: shutting down Angel, I think that's OK as long as Angel doesn't delete the pidfile and I can still make sure that they go away. Failing that, I would recommend some sort of SIGTERM (wait 10 seconds) SIGKILL process, although the downside there is that sometimes you have processes that you never want to SIGKILL (databases, notably) and so you'd want those to wait. In that case, it may not make sense for you to implement that behavior -- the downside is too large. (And I don't suggest making it an option. I dislike options.)
Anyway, with the PID files and SIGINT I think that I can accomplish everything I need. Thanks so much for the suggestions. I'll leave this open in case, but I suspect you can close this.
Hate to resurrect this from the dead but I'm trying to poll users of Angel about the correct behavior here to see if I can actually close it. You raise good points on the tradeoff between configuration and required behavior. I think the only way to decide between leaving the behavior how it is or adding new behavior is whether or not there's an agreeable default that requires no configuration.
The safest default would be to only send a SIGINT and go no further. I like to be conservative if the user has not told me to do something explicitly andwhat I'm going to do may burn them. If the user specifies a config option for how long to wait before a forceful SIGKILL per process, i.e. kill_delay = 10.
tl;dr there are 2 possibilities:
- Bringing angel down sends a SIGINT to all tethered processes and goes no further (current).
- Bringing angel down sends a SIGINT and then if configured for the process, a SIGKILL after a delay. No explicit choice means no SIGKILL.
Sorry for the slow response!
I think #2 sounds fine. I'd like to be able to have this kind of thing, so I can ensure software dies when Angel does. It makes my restart logic a lot saner. :)
Mark Smith [1][email protected]
On Thu, Sep 12, 2013, at 06:12 PM, Michael Xavier wrote:
Hate to resurrect this from the dead but I'm trying to poll users of Angel about the correct behavior here to see if I can actually close it. You raise good points on the tradeoff between configuration and required behavior. I think the only way to decide between leaving the behavior how it is or adding new behavior is whether or not there's an agreeable default that requires no configuration.
The safest default would be to only send a SIGINT and go no further. I like to be conservative if the user has not told me to do something explicitly andwhat I'm going to do may burn them. If the user specifies a config option for how long to wait before a forceful SIGKILL per process, i.e. kill_delay = 10.
tl;dr there are 2 possibilities:
-
Bringing angel down sends a SIGINT to all tethered processes and goes no further (current).
-
Bringing angel down sends a SIGINT and then if configured for the process, a SIGKILL after a delay. No explicit choice means no SIGKILL.
—
Reply to this email directly or [2]view it on GitHub. [ZuiySmFuPRZZxGwgFvMZrckj0Mca9nh3nYESZt15JAlz0AOlaveNb_YRuoK2NbRZ.gif]
References
- mailto:[email protected]
- https://github.com/MichaelXavier/Angel/issues/25#issuecomment-24367200
Since this isn't closed, I assume you haven't made a decision yet.
I've recently started using Angel, and love it already. My personal vote would be on the second option described. This makes no breaking changes to the current behaviour, and adds a nice option to forcefully kill scripts that may not want to die for various reasons.
In short, a kill_delay option sounds like a very good approach.
Just to throw in my opinion on the matter since it still seems unresolved.
I actually did start work on this a while ago but could not get the tests working. It is on this branch
https://github.com/MichaelXavier/Angel/tree/sigkill
I may try picking this back up again but if you wanted to take a look at it would be appreciated. The tests run a binary that yields to SIGTERMs and one that doesn't, and I couldn't get consistent behavior out of it.
I took some time this evening and figured out the issues and have a passing test suite in development. I have to do some cleanup after. Would either one of you be willing to test out the solution prior to release and make sure it fits your needs?
Alright guys, I think the feature is ready for some real world testing. Take a look at the sigkill branch here:
https://github.com/MichaelXavier/Angel/tree/sigkill
To achieve what you guys are after, you should just have to add a termgrace = 10 line to each program's config, where 10 is the number of seconds to wait between sending a gentle shutdown with sigterm and a hard shutdown with sigkill. Setting the value to "off", 0, or omitting it will disable this functionality and only a term will get sent.
Any luck with this @Tehnix and @xb95 ?