runitor icon indicating copy to clipboard operation
runitor copied to clipboard

Feature idea: signal failure if command completes too quickly

Open cuu508 opened this issue 4 years ago • 4 comments

Sometimes apps and scripts fail early, but still return exit code 0. And legacy systems can be hard to fix.

Perhaps there could be an optional feature where runitor measures the execution time of the command, and signals failure if the command completes too quickly? Something like:

# signals success
runitor -min-time=5s -uuid 6e1fbf8f-c17e-4749-af44-0c81461bdd19 -- sleep 1

# signals failure
runitor -min-time=5s -uuid 6e1fbf8f-c17e-4749-af44-0c81461bdd19 -- sleep 6

cuu508 avatar May 13 '20 10:05 cuu508

PS. This could also be done on server side, there's a ticket for that: https://github.com/healthchecks/healthchecks/issues/236

One benefit of doing it on client side is more precise timing. HTTP requests have latency – if you measure second-long events using HTTP requests that can also sometimes take seconds, you'll sometimes get false positives and false negatives.

cuu508 avatar May 13 '20 10:05 cuu508

Huh. Initially I thought about a possible max-runtime feature to catch runaway processes but that's not very Unix when we already have the right tool for it, timeout from GNU coreutils.

This one is the other way around. I'm curious, how much of a common problem is this?

bdd avatar May 13 '20 20:05 bdd

Here's a one minute implementation https://github.com/bdd/runitor/commit/e0af841e9935d8b213a11a7290eec58c52a9b191

Honestly it doesn't feel like this feature fits in along with the others. Other than stdout & stderr routing, runitor just implements healthchecks.io Ping API features. Nothing less, nothing more.

Would you consider client measured run duration to be passed as a parameter for success pings (and for symmetry also failure)? This way users can define "inverse of grace period" as mentioned in healthchecks/healthchecks#236, and the act of lifting the signal to failure is done on the server side.

bdd avatar May 14 '20 05:05 bdd

I'm curious, how much of a common problem is this?

It's not a common request. It has been requested a few times (in #236 and in email). Measuring run time on client was never explicitly suggested, that was my idea.

Also, if you have a legacy environment where you cannot fix the exit code, chances are you also cannot add a wrapper (runitor) around your command.

I think this feature would fit in if you're OK with Swiss army knife kind of a tool. Various niche but sometimes useful features (examples: uwsgi, curl, ImageMagick, caddy). If you'd rather keep it small and focused, then it's not a good fit.

Would you consider client measured run duration to be passed as a parameter for success pings (and for symmetry also failure)?

The API currently doesn't support that, but that could be added. Reported execution time would take priority over the execution time measured on server. Dead Man Snitch's API and Field Agent works like that (no start signal, execution time measured on client and reported after the job completes).

cuu508 avatar May 14 '20 07:05 cuu508

Honestly it doesn't feel like this feature fits in along with the others.

Yep, and there's always the option to wrap the legacy app in a custom shell script, with custom success/failure testing criteria (how much time it took? what output did it produce? did it produce file X on the filesystem? etc.).

cuu508 avatar Oct 27 '22 11:10 cuu508