resque-sliders icon indicating copy to clipboard operation
resque-sliders copied to clipboard

How do I control how long a process watched by KEWatcher can live before it is killed?

Open weiser opened this issue 11 years ago • 9 comments

What I would like to do is kill a process watched by KEWatcher after it has been alive for an hour.

Looking through the docs at https://github.com/kmullin/resque-sliders/blob/master/README.md, I'm not finding what I'm looking for (but I'm probably misunderstanding the docs).

Could someone please point me in the right direction? (If this functionality doesn't exist, I can implement it and submit a PR.)

weiser avatar May 06 '13 16:05 weiser

Hey!

So currently, there are no internal timings for watched processes. Meaning KEWatcher does not currently know how long a process has been running so it cannot kill based on a duration of time.

KEWatcher's main goal is just to keep workers running until you change them in the resque-web ui. If you were to kill the job after an hour, KEWatcher would just poll Redis again for what jobs it should be watching, and it would re-launch it, unless it was removed from the list in Redis.

I've noticed however, there is a need to track each worker process' system resource usages, such as memory and CPU, and kill based on certain conditions/thresholds there. Though, this likely isn't what you're looking for.

Might I ask what sort of use case your looking for, to have it kill a job after a certain duration?

kmullin avatar May 06 '13 18:05 kmullin

Currently, we're seeing cases where resque workers are running for 6+ hours and taking 40% of system RAM.

Normally, we'd expect (as a very simple rule) that no process will take more than an hour to run. This is the scratch that I'm trying to itch.

I've got a PR up for the version of Sliders we are using at: https://github.com/seomoz/resque-sliders/commit/99ff299f079ed092c2b05361e1698d65db309185

If it'd be useful to the community, I'd like to make sure it gets pulled into your branch as well.

weiser avatar May 06 '13 18:05 weiser

Ok so you're more worried about jobs that should be running, but are stuck somewhere in a loop, or waiting for something that'll never finish, and requires a kill signal and a re-launch.

We've definitely seen this behavior as well with a few of our jobs, so this isn't a bad idea. However, I would be more inclined to track the usage based on CPU or memory, instead of solely duration. So you can say 'after an hour of >90% cpu|memory usage, kill'. Any thoughts on that?

I don't see why these two 'features' shouldn't be merged and bundled as one, although I know we do sometimes have a couple jobs that do run for hours, so the default setting of an hour shouldn't be the default.

kmullin avatar May 06 '13 21:05 kmullin

+1 on the "after an hour of 90%+ cpu/memory usage, kill." This would be super-similar to how god (http://godrb.com/) does its thing.

If you don't want a default of an hour, how about not using a default? i.e. if a user proxides a --max-run-time, use it, otherwise pretend that it doesn't exist.

weiser avatar May 06 '13 23:05 weiser

Cool, ok. Hopefully if I get some time this week, I can implement the resource usage conditional. It would certainly be helpful for our use case. Hopefully others will find it useful as well.

Would you want both features? IE. a certain time condtion, AND resource conditions based on time? Or would the resource conditions be enough to work for you?

kmullin avatar May 06 '13 23:05 kmullin

I think that both features would be really useful. The overall problem, IMO, is that sometimes jobs do unexpected things: they use more resources than we wanted then to (which could cause cascading failures, depending on the number of resque workers you have running on a box), or they take longer than we wanted them to. For these cases, it is super-important that KEWatcher can kill these rogue processes fast.

weiser avatar May 07 '13 14:05 weiser

I've changed how we implement this in our branch of resque-sliders: https://github.com/seomoz/resque-sliders/commit/c71ac67c84160597fafd30e939346310ff978a5d

Basically, we just find all the child processes of the Resque Workers, and if they've been running for longer than X, kill them. (The last commit I showed you actually killed the parent Resque workers, not their children. That was not good.)

This could likely be expanded to look out for resource consumption as well (e.g. ps -A -o pid,%cpu,%mem,command | grep... instead of ps -A -o pid,command | grep...).

Anyway, food for thought. I know you'd rather have the @max_running_time be optional, not defaulted, like I do (which is why I'm not making a PR out of this).

weiser avatar May 14 '13 15:05 weiser

This is great. Thanks!

I am going on vacation here in bit, and won't have to look over this and merge it in until next week.

kmullin avatar May 14 '13 22:05 kmullin

@davidann what you have is great, I'd love to incorporate the changes. If you wanted me to merge in that feature, please create a feature branch off my current master branch.

Although it looks like your kill_long_running_processes! is in the main active event loop, and isn't throttled, so its running every interval seconds, which is basically hard-coded to 0.1. This is most likely causing unneeded ps lookups.

You should probably throttle it to each tick seconds, or at least somewhere between that and a 1 second interval.

kmullin avatar May 21 '13 23:05 kmullin