Persistent job not restarted if resource limit is exceeded
Summary
I'm using Cronicle (which I love BTW :) ) to run multiple persistent jobs on multiple servers using this method: https://github.com/jhuckaby/Cronicle/wiki/Continuously-Running-Jobs.
It works perfectly except in the case when the job exceeds memory limit and gets killed by Cronicle. In this case it is not restarted although both "Run Event on Success" and "Run Event on Failure" are setup (to the same job).
Steps to reproduce the problem
Create an on-demand job, setup chain reaction to restart the job automatically (both success and fail should point to the same job). Retries 0, concurrency 1.
Set memory limit to 50 MB.
Use shell plugin to run the following script (it will allocate a little bit more than 100MB)
#!/bin/bash
A="0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef" for power in $(seq 20); do A="${A}${A}" done
sleep 100
Run the job. It will be taken down by Cronicle in 10-15 seconds and it won't be restarted.
Your Setup
Operating system and version?
Ubuntu 20.04 LTS
Node.js version?
10.19.0
Cronicle software version?
0.9.3
Are you using a multi-server setup, or just a single server?
Multi-server, but the issue is reproduced on the master server itself as well.
Are you using the filesystem as back-end storage, or S3/Couchbase?
FS
Can you reproduce the crash consistently?
Yes
Log Excerpts
Here is the log of the job which is not restarted.
.# Job failed at 2022/05/14 05:20:43 (GMT+1). .# Error: Job Aborted: Exceeded memory limit of 1 GB .# End of log.
First, I have to point out, although I don't think it is relevant to this issue, you're running Node version 10.19. This is very old, and unsupported. I highly recommend you upgrade to the latest LTS (Node.js v16). I am honestly kind of amazed Cronicle v0.9.3 even runs on Node 10.
So yeah, the resource limit thing is definitely a problem with "continuous" events. The resource limit system aborts the job, rather than failing it. Aborted jobs don't activate the chain reaction system (by design), so the "continuous" event trick doesn't work in this case.
It seems like you shouldn't use Cronicle's monitoring system here, and instead rely on the OS. The Linux kernel should kill an out of control runaway process, and that should fail the job rather than abort it.
Also, Cronicle v2 (Orchestra) has a way to handle this as well. It has a new "Continuous" mode you can select in the timing options, which will always keep the job running, no matter what action caused it to fail / abort / complete / whatever.
So this issue is fixed in v2. I don't see an easy way to do it in v0, however, so I'd recommend you disable the memory monitor for your continuous event (or just set it to a super-high number like 1 TB), and allow the OS to kill the process for OOM.
Orchestra will be released this year (2022).
Thank you for your comment and good luck with releasing the Orchestra)
I cannot leave the RAM cosumption of the task uncontrolled because I have multiple tasks in the system and I also need some free RAM for page cache. So, I implemented a workaround like this:
ulimit -v 1000000 #actual shell code here ulimit -v unlimited
It seems to do the trick - kill the process without help of Cronicle (Cronicle limit is set to 1 TB)
Ah, very clever! I did not know ulimit could govern memory in that way. Nice solution!