resque-sliders
resque-sliders copied to clipboard
How do I control how long a process watched by KEWatcher can live before it is killed?
What I would like to do is kill a process watched by KEWatcher after it has been alive for an hour.
Looking through the docs at https://github.com/kmullin/resque-sliders/blob/master/README.md, I'm not finding what I'm looking for (but I'm probably misunderstanding the docs).
Could someone please point me in the right direction? (If this functionality doesn't exist, I can implement it and submit a PR.)
Hey!
So currently, there are no internal timings for watched processes. Meaning KEWatcher does not currently know how long a process has been running so it cannot kill based on a duration of time.
KEWatcher's main goal is just to keep workers running until you change them in the resque-web ui. If you were to kill the job after an hour, KEWatcher would just poll Redis again for what jobs it should be watching, and it would re-launch it, unless it was removed from the list in Redis.
I've noticed however, there is a need to track each worker process' system resource usages, such as memory and CPU, and kill based on certain conditions/thresholds there. Though, this likely isn't what you're looking for.
Might I ask what sort of use case your looking for, to have it kill a job after a certain duration?
Currently, we're seeing cases where resque workers are running for 6+ hours and taking 40% of system RAM.
Normally, we'd expect (as a very simple rule) that no process will take more than an hour to run. This is the scratch that I'm trying to itch.
I've got a PR up for the version of Sliders we are using at: https://github.com/seomoz/resque-sliders/commit/99ff299f079ed092c2b05361e1698d65db309185
If it'd be useful to the community, I'd like to make sure it gets pulled into your branch as well.
Ok so you're more worried about jobs that should be running, but are stuck somewhere in a loop, or waiting for something that'll never finish, and requires a kill signal and a re-launch.
We've definitely seen this behavior as well with a few of our jobs, so this isn't a bad idea. However, I would be more inclined to track the usage based on CPU or memory, instead of solely duration. So you can say 'after an hour of >90% cpu|memory usage, kill'. Any thoughts on that?
I don't see why these two 'features' shouldn't be merged and bundled as one, although I know we do sometimes have a couple jobs that do run for hours, so the default setting of an hour shouldn't be the default.
+1 on the "after an hour of 90%+ cpu/memory usage, kill." This would be super-similar to how god (http://godrb.com/) does its thing.
If you don't want a default of an hour, how about not using a default? i.e. if a user proxides a --max-run-time
, use it, otherwise pretend that it doesn't exist.
Cool, ok. Hopefully if I get some time this week, I can implement the resource usage conditional. It would certainly be helpful for our use case. Hopefully others will find it useful as well.
Would you want both features? IE. a certain time condtion, AND resource conditions based on time? Or would the resource conditions be enough to work for you?
I think that both features would be really useful. The overall problem, IMO, is that sometimes jobs do unexpected things: they use more resources than we wanted then to (which could cause cascading failures, depending on the number of resque workers you have running on a box), or they take longer than we wanted them to. For these cases, it is super-important that KEWatcher can kill these rogue processes fast.
I've changed how we implement this in our branch of resque-sliders: https://github.com/seomoz/resque-sliders/commit/c71ac67c84160597fafd30e939346310ff978a5d
Basically, we just find all the child processes of the Resque Workers, and if they've been running for longer than X, kill them. (The last commit I showed you actually killed the parent Resque workers, not their children. That was not good.)
This could likely be expanded to look out for resource consumption as well (e.g. ps -A -o pid,%cpu,%mem,command | grep...
instead of ps -A -o pid,command | grep...
).
Anyway, food for thought. I know you'd rather have the @max_running_time be optional, not defaulted, like I do (which is why I'm not making a PR out of this).
This is great. Thanks!
I am going on vacation here in bit, and won't have to look over this and merge it in until next week.
@davidann what you have is great, I'd love to incorporate the changes. If you wanted me to merge in that feature, please create a feature branch off my current master branch.
Although it looks like your kill_long_running_processes!
is in the main active event loop, and isn't throttled, so its running every interval
seconds, which is basically hard-coded to 0.1
. This is most likely causing unneeded ps
lookups.
You should probably throttle it to each tick
seconds, or at least somewhere between that and a 1 second interval.