Cronicle icon indicating copy to clipboard operation
Cronicle copied to clipboard

Persistent job not restarted if resource limit is exceeded

Open limhyesook opened this issue 3 years ago • 3 comments

Summary

I'm using Cronicle (which I love BTW :) ) to run multiple persistent jobs on multiple servers using this method: https://github.com/jhuckaby/Cronicle/wiki/Continuously-Running-Jobs.

It works perfectly except in the case when the job exceeds memory limit and gets killed by Cronicle. In this case it is not restarted although both "Run Event on Success" and "Run Event on Failure" are setup (to the same job).

Steps to reproduce the problem

Create an on-demand job, setup chain reaction to restart the job automatically (both success and fail should point to the same job). Retries 0, concurrency 1.

Set memory limit to 50 MB.

Use shell plugin to run the following script (it will allocate a little bit more than 100MB)

#!/bin/bash

A="0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef" for power in $(seq 20); do A="${A}${A}" done

sleep 100

Run the job. It will be taken down by Cronicle in 10-15 seconds and it won't be restarted.

Your Setup

Operating system and version?

Ubuntu 20.04 LTS

Node.js version?

10.19.0

Cronicle software version?

0.9.3

Are you using a multi-server setup, or just a single server?

Multi-server, but the issue is reproduced on the master server itself as well.

Are you using the filesystem as back-end storage, or S3/Couchbase?

FS

Can you reproduce the crash consistently?

Yes

Log Excerpts

Here is the log of the job which is not restarted.

.# Job failed at 2022/05/14 05:20:43 (GMT+1). .# Error: Job Aborted: Exceeded memory limit of 1 GB .# End of log.

limhyesook avatar May 14 '22 18:05 limhyesook

First, I have to point out, although I don't think it is relevant to this issue, you're running Node version 10.19. This is very old, and unsupported. I highly recommend you upgrade to the latest LTS (Node.js v16). I am honestly kind of amazed Cronicle v0.9.3 even runs on Node 10.

So yeah, the resource limit thing is definitely a problem with "continuous" events. The resource limit system aborts the job, rather than failing it. Aborted jobs don't activate the chain reaction system (by design), so the "continuous" event trick doesn't work in this case.

It seems like you shouldn't use Cronicle's monitoring system here, and instead rely on the OS. The Linux kernel should kill an out of control runaway process, and that should fail the job rather than abort it.

Also, Cronicle v2 (Orchestra) has a way to handle this as well. It has a new "Continuous" mode you can select in the timing options, which will always keep the job running, no matter what action caused it to fail / abort / complete / whatever.

So this issue is fixed in v2. I don't see an easy way to do it in v0, however, so I'd recommend you disable the memory monitor for your continuous event (or just set it to a super-high number like 1 TB), and allow the OS to kill the process for OOM.

Orchestra will be released this year (2022).

jhuckaby avatar May 15 '22 21:05 jhuckaby

Thank you for your comment and good luck with releasing the Orchestra)

I cannot leave the RAM cosumption of the task uncontrolled because I have multiple tasks in the system and I also need some free RAM for page cache. So, I implemented a workaround like this:

ulimit -v 1000000 #actual shell code here ulimit -v unlimited

It seems to do the trick - kill the process without help of Cronicle (Cronicle limit is set to 1 TB)

limhyesook avatar May 16 '22 10:05 limhyesook

Ah, very clever! I did not know ulimit could govern memory in that way. Nice solution!

jhuckaby avatar May 16 '22 19:05 jhuckaby