Cronicle Cronicle fails after 7 days of work

After 7 days of operation, the slave server starts to generate several points in the activity log, always losing connectivity for a few seconds and then coming back, which interferes with the process of the jobs running on it.

Summary

Cronicle slave server fails after ~7 days of work

Steps to reproduce the problem

My tests were based on letting it run for seven days, so I have this problem constantly

Your Setup

I have two ec2 on aws, the slave as t2.large and the master as t2.small. The master is for UI only, it does not run any jobs, while the slave runs all jobs. I use s3 as storage system for both. These failures usually occur at peak times (where I can run an average of 20/30 jobs in the same minute). Slave ec2 cpu (t2.large) never went above 30%

Operating system and version?

Ubuntu 20.04.2 LTS

Node.js version?

v17.7.2

Cronicle software version?

0.9.2

Are you using a multi-server setup, or just a single server?

multi-server setup, master and slave

Are you using the filesystem as back-end storage, or S3/Couchbase?

S3

Can you reproduce the crash consistently?

Only if i let it run for seven days

Log Excerpts

it doesn't generate any logs for me

@jhuckaby

Mar 30 '22 12:03 srgoogle23

I've never heard of this happening before. I run a large Cronicle cluster of many servers on live production for months at a time, with no issues like this.

It sounds like the server may be running out of memory? I can't think of anything else that would cause a random disconnection after 7 days.

Apr 04 '22 23:04 jhuckaby

@srgoogle23 Could be also network issue. Are those machines have static IPs? Anyway you can check logs/Cronicle.log file, see if cronicle was crashing/restarting, or maybe VM restarted itself. Doesn't sound like cronicle issue.

Apr 05 '22 02:04 mikeTWC1984

It's not a problem of vm, cloudwatch hasn't issued any down alerts, while cronicle hasn't issued any logs yet, it's like it doesn't actually disconnect

Apr 07 '22 00:04 srgoogle23

@jhuckaby i will check the memory tomorrow morning and play what i found here

Apr 07 '22 00:04 srgoogle23

@jhuckaby its not a memory issue

Apr 07 '22 13:04 srgoogle23

[1650306586.418][2022-04-18 18:29:46][crons.zukk.in][441580][Error][error][job][Failed to fetch job log file: http://172.31.60.150:3012/api/app/fetch_delete_job_log?path=%2Fopt%2Fcronicle%2Flogs%2Fjobs%2Fjl251w6er9c.log&auth=38b937b3eeef304e302013184d86ab7b39bb6845d6a8a64e2dd0b49a78de7ffb: Error: Socket Timeout (30000 ms)][]
[1650306593.336][2022-04-18 18:29:53][crons.zukk.in][441580][Error][error][server][Slave connection failed: crons-worker.zukk.in: Error: timeout][]

@jhuckaby Error.log

Apr 18 '22 18:04 srgoogle23

@jhuckaby thats the frist time that pass by 7 days:

Apr 18 '22 18:04 srgoogle23

@jhuckaby i verified on aws and cloudwatch doest send any alert about server fail

Apr 18 '22 18:04 srgoogle23

Log on slave: WebServer.log:

[1650306571.02][2022-04-18 18:29:31][crons-worker.zukk.in][4502][WebServer][error][socket][Socket closed unexpectedly: c1049018][{"id":"c1049018","proto":"http","port":3012,"time_start":1650306434523,"num_requests":0,"bytes_in":0,"bytes_out":0,"aborted":true,"total_elapsed":136496,"url":"http://xxx.xxx.xxx.xxx:3012/api/app/fetch_delete_job_log?path=%2Fopt%2Fcronicle%2Flogs%2Fjobs%2Fjl251tzn28e.log&auth=90d41b9d5dcaa1709a3dd21706dbf65da29cee5199a4f239da841afd953314c3","ips":["xxx.xxx.xxx.xxx"],"req_id":"r1049034"}]

Apr 18 '22 18:04 srgoogle23

I disables both ( master and slave ) and after start it again, is it geting me this error

Mon Apr 18 2022 18:58:40 GMT+0000 (Coordinated Universal Time) - crons-worker.zukk.in - PID 1193
RangeError: Maximum call stack size exceeded
    at debug (/opt/cronicle/node_modules/debug/src/common.js:68:15)
    at Socket.sendPacket (/opt/cronicle/node_modules/engine.io/build/socket.js:372:13)
    at Socket.write (/opt/cronicle/node_modules/engine.io/build/socket.js:351:14)
    at Client.writeToEngine (/opt/cronicle/node_modules/socket.io/dist/client.js:171:23)
    at Client._packet (/opt/cronicle/node_modules/socket.io/dist/client.js:160:14)
    at Socket.packet (/opt/cronicle/node_modules/socket.io/dist/socket.js:179:21)
    at Socket.emit (/opt/cronicle/node_modules/socket.io/dist/socket.js:97:14)
    at constructor.masterSocketEmit (/opt/cronicle/lib/engine.js:326:12)
    at constructor.uploadJobLog (/opt/cronicle/lib/job.js:1209:32)
    at /opt/cronicle/lib/engine.js:765:14

Apr 18 '22 19:04 srgoogle23

Cronicle Cronicle copied to clipboard

Cronicle fails after 7 days of work

Summary

Steps to reproduce the problem

Your Setup

Operating system and version?

Node.js version?

Cronicle software version?

Are you using a multi-server setup, or just a single server?

Are you using the filesystem as back-end storage, or S3/Couchbase?

Can you reproduce the crash consistently?

Log Excerpts

Cronicle
Cronicle copied to clipboard