Ghost icon indicating copy to clipboard operation
Ghost copied to clipboard

Instance becomes unresponsive after `email-analytics-fetch-latest` uses up all the cpu

Open anujdeshpande opened this issue 1 year ago • 6 comments

Issue Summary

2 days in a row, my ghost site became unresponsive. After checking the logs, I see that the there's a service that's being started again and again.

Happy to help figure what might be going on here!

Steps to Reproduce

  1. Create an AWS Lightsail instance
  2. Configure mailgun keys
  3. Check that everything works and let the site be for 2 days

image image

Ghost Version

5.7.0

Node.js Version

16.16.0

How did you install Ghost?

Instructions for Ubuntu

Database type

MySQL 8

Browser & OS version

Ubuntu 20.04 Latest

Relevant log / error output

Aug 06 12:16:52 ip-172-26-8-210 systemd[1]: Started Ghost systemd service for blog: makerville-io.
-- Reboot --
Aug 06 11:01:56 ip-172-26-8-210 node[895]: 
Aug 06 11:01:56 ip-172-26-8-210 node[895]:     at processTimers (node:internal/timers:502:7)
Aug 06 11:01:56 ip-172-26-8-210 node[895]:     at listOnTimeout (node:internal/timers:559:17)
Aug 06 11:01:56 ip-172-26-8-210 node[895]:     at Timeout.scheduleTimeout [as _onTimeout] (/var/www/sitename/versions/5.7.0/node_modules/@breejs/later/lib/index.js:1038:7)
Aug 06 11:01:56 ip-172-26-8-210 node[895]:     at /var/www/sitename/versions/5.7.0/node_modules/bree/lib/index.js:546:27
Aug 06 11:01:56 ip-172-26-8-210 node[895]:     at Bree.run (/var/www/sitename/versions/5.7.0/node_modules/bree/lib/index.js:324:42)
Aug 06 11:01:56 ip-172-26-8-210 node[895]: Error: Job "email-analytics-fetch-latest" is already running
Aug 06 11:01:56 ip-172-26-8-210 node[895]: ----------------------------------------
Aug 06 11:01:56 ip-172-26-8-210 node[895]: Job "email-analytics-fetch-latest" is already running
Aug 06 11:01:56 ip-172-26-8-210 node[895]: 
Aug 06 11:01:56 ip-172-26-8-210 node[895]: [2022-08-06 10:45:45] WARN Job "email-analytics-fetch-latest" is already running
Aug 06 10:47:18 ip-172-26-8-210 node[895]: 
Aug 06 10:47:18 ip-172-26-8-210 node[895]:     at processTimers (node:internal/timers:502:7)
Aug 06 10:47:18 ip-172-26-8-210 node[895]:     at listOnTimeout (node:internal/timers:559:17)
Aug 06 10:47:18 ip-172-26-8-210 node[895]:     at Timeout.scheduleTimeout [as _onTimeout] (/var/www/sitename/versions/5.7.0/node_modules/@breejs/later/lib/index.js:1038:7)
Aug 06 10:47:18 ip-172-26-8-210 node[895]:     at /var/www/sitename/versions/5.7.0/node_modules/bree/lib/index.js:546:27
Aug 06 10:47:18 ip-172-26-8-210 node[895]:     at Bree.run (/var/www/sitename/versions/5.7.0/node_modules/bree/lib/index.js:324:42)
Aug 06 10:47:18 ip-172-26-8-210 node[895]: Error: Job "email-analytics-fetch-latest" is already running
Aug 06 10:47:18 ip-172-26-8-210 node[895]: ----------------------------------------
Aug 06 10:47:18 ip-172-26-8-210 node[895]: Job "email-analytics-fetch-latest" is already running
Aug 06 10:47:18 ip-172-26-8-210 node[895]: 
Aug 06 10:47:18 ip-172-26-8-210 node[895]: [2022-08-06 10:40:57] WARN Job "email-analytics-fetch-latest" is already running
Aug 06 10:47:18 ip-172-26-8-210 node[895]: 
Aug 06 10:47:18 ip-172-26-8-210 node[895]:     at processTimers (node:internal/timers:502:7)
Aug 06 10:47:18 ip-172-26-8-210 node[895]:     at listOnTimeout (node:internal/timers:559:17)
Aug 06 10:47:18 ip-172-26-8-210 node[895]:     at Timeout.scheduleTimeout [as _onTimeout] (/var/www/sitename/versions/5.7.0/node_modules/@breejs/later/lib/index.js:1038:7)
Aug 06 10:47:18 ip-172-26-8-210 node[895]:     at /var/www/sitename/versions/5.7.0/node_modules/bree/lib/index.js:546:27
Aug 06 10:47:18 ip-172-26-8-210 node[895]:     at Bree.run (/var/www/sitename/versions/5.7.0/node_modules/bree/lib/index.js:324:42)
Aug 06 10:47:18 ip-172-26-8-210 node[895]: Error: Job "email-analytics-fetch-latest" is already running
Aug 06 10:47:18 ip-172-26-8-210 node[895]: ----------------------------------------
Aug 06 10:47:18 ip-172-26-8-210 node[895]: Job "email-analytics-fetch-latest" is already running
Aug 06 10:47:18 ip-172-26-8-210 node[895]: 
Aug 06 10:47:18 ip-172-26-8-210 node[895]: [2022-08-06 10:36:10] WARN Job "email-analytics-fetch-latest" is already running
Aug 06 10:37:07 ip-172-26-8-210 node[895]: 
Aug 06 10:37:07 ip-172-26-8-210 node[895]:     at processTimers (node:internal/timers:502:7)
Aug 06 10:37:07 ip-172-26-8-210 node[895]:     at listOnTimeout (node:internal/timers:559:17)
Aug 06 10:37:07 ip-172-26-8-210 node[895]:     at Timeout.scheduleTimeout [as _onTimeout] (/var/www/sitename/versions/5.7.0/node_modules/@breejs/later/lib/index.js:1038:7)
Aug 06 10:37:07 ip-172-26-8-210 node[895]:     at /var/www/sitename/versions/5.7.0/node_modules/bree/lib/index.js:546:27
Aug 06 10:37:07 ip-172-26-8-210 node[895]:     at Bree.run (/var/www/sitename/versions/5.7.0/node_modules/bree/lib/index.js:324:42)

Code of Conduct

  • [X] I agree to be friendly and polite to people in this repository

anujdeshpande avatar Aug 06 '22 12:08 anujdeshpande

I pasted the logs before I reboot in the message above, the logs when this first started happening are below

Aug 06 05:49:22 ip-172-26-8-210 node[895]: [2022-08-06 05:49:20] WARN Job "email-analytics-fetch-latest" is already running
Aug 06 05:46:24 ip-172-26-8-210 node[895]: [2022-08-06 05:46:21] INFO "GET /" 200 12172ms
Aug 06 05:45:18 ip-172-26-8-210 node[895]: [2022-08-06 05:45:16] INFO "GET /sw.js" 301 62732ms
Aug 06 05:45:14 ip-172-26-8-210 node[895]: [2022-08-06 05:45:12] INFO "GET /wp-admin/css/" 301 2072ms
Aug 06 05:44:42 ip-172-26-8-210 node[895]: 
Aug 06 05:44:42 ip-172-26-8-210 node[895]:     at processTimers (node:internal/timers:502:7)
Aug 06 05:44:42 ip-172-26-8-210 node[895]:     at listOnTimeout (node:internal/timers:559:17)
Aug 06 05:44:42 ip-172-26-8-210 node[895]:     at Timeout.scheduleTimeout [as _onTimeout] (/var/www/sitename/versions/5.7.0/node_modules/@breejs/later/lib/index.js:1038:7)
Aug 06 05:44:42 ip-172-26-8-210 node[895]:     at /var/www/sitename/versions/5.7.0/node_modules/bree/lib/index.js:546:27
Aug 06 05:44:42 ip-172-26-8-210 node[895]:     at Bree.run (/var/www/sitename/versions/5.7.0/node_modules/bree/lib/index.js:324:42)
Aug 06 05:44:42 ip-172-26-8-210 node[895]: Error: Job "email-analytics-fetch-latest" is already running
Aug 06 05:44:42 ip-172-26-8-210 node[895]: ----------------------------------------
Aug 06 05:44:42 ip-172-26-8-210 node[895]: Job "email-analytics-fetch-latest" is already running
Aug 06 05:44:42 ip-172-26-8-210 node[895]: 
Aug 06 05:44:42 ip-172-26-8-210 node[895]: [2022-08-06 05:44:42] WARN Job "email-analytics-fetch-latest" is already running
Aug 06 05:44:41 ip-172-26-8-210 node[895]: [2022-08-06 05:44:40] INFO "GET /ems-automation/" 200 96227ms
Aug 06 05:43:54 ip-172-26-8-210 node[895]: [2022-08-06 05:43:52] INFO "GET /sw.js" 301 37440ms
Aug 06 05:39:44 ip-172-26-8-210 node[895]: 
Aug 06 05:39:44 ip-172-26-8-210 node[895]:     at processTimers (node:internal/timers:502:7)
Aug 06 05:39:44 ip-172-26-8-210 node[895]:     at listOnTimeout (node:internal/timers:559:17)
Aug 06 05:39:44 ip-172-26-8-210 node[895]:     at Timeout.scheduleTimeout [as _onTimeout] (/var/www/sitename/versions/5.7.0/node_modules/@breejs/later/lib/index.js:1038:7)
Aug 06 05:39:44 ip-172-26-8-210 node[895]:     at /var/www/sitename/versions/5.7.0/node_modules/bree/lib/index.js:546:27
Aug 06 05:39:44 ip-172-26-8-210 node[895]:     at Bree.run (/var/www/sitename/versions/5.7.0/node_modules/bree/lib/index.js:324:42)
Aug 06 05:39:44 ip-172-26-8-210 node[895]: Error: Job "email-analytics-fetch-latest" is already running
Aug 06 05:39:44 ip-172-26-8-210 node[895]: ----------------------------------------
Aug 06 05:39:44 ip-172-26-8-210 node[895]: Job "email-analytics-fetch-latest" is already running
Aug 06 05:39:44 ip-172-26-8-210 node[895]: 
Aug 06 05:39:44 ip-172-26-8-210 node[895]: [2022-08-06 05:39:41] WARN Job "email-analytics-fetch-latest" is already running
Aug 06 05:37:36 ip-172-26-8-210 node[895]: 
Aug 06 05:37:36 ip-172-26-8-210 node[895]:     at processTimers (node:internal/timers:502:7)
Aug 06 05:37:36 ip-172-26-8-210 node[895]:     at listOnTimeout (node:internal/timers:559:17)
Aug 06 05:37:36 ip-172-26-8-210 node[895]:     at Timeout.scheduleTimeout [as _onTimeout] (/var/www/sitename/versions/5.7.0/node_modules/@breejs/later/lib/index.js:1038:7)
Aug 06 05:37:36 ip-172-26-8-210 node[895]:     at /var/www/sitename/versions/5.7.0/node_modules/bree/lib/index.js:546:27
Aug 06 05:37:36 ip-172-26-8-210 node[895]:     at Bree.run (/var/www/sitename/versions/5.7.0/node_modules/bree/lib/index.js:324:42)
Aug 06 05:37:36 ip-172-26-8-210 node[895]: Error: Job "email-analytics-fetch-latest" is already running
Aug 06 05:37:36 ip-172-26-8-210 node[895]: ----------------------------------------
Aug 06 05:37:36 ip-172-26-8-210 node[895]: Job "email-analytics-fetch-latest" is already running
Aug 06 05:37:36 ip-172-26-8-210 node[895]: 
Aug 06 05:37:36 ip-172-26-8-210 node[895]: [2022-08-06 05:31:59] WARN Job "email-analytics-fetch-latest" is already running
Aug 06 05:24:19 ip-172-26-8-210 node[895]: [2022-08-06 05:24:19] INFO Worker for job "email-analytics-fetch-latest" online
Aug 06 05:22:08 ip-172-26-8-210 node[895]: [2022-08-06 05:22:08] INFO "GET /ems-automation/" 200 31764ms
Aug 06 05:22:04 ip-172-26-8-210 node[895]: [2022-08-06 05:22:04] INFO "GET /sw.js" 301 3045ms

anujdeshpande avatar Aug 06 '22 12:08 anujdeshpande

On closer look, some of the logs in the middle have a ECONNREFUSED

Aug 06 06:24:17 ip-172-26-8-210 node[895]: connect ECONNREFUSED ::1:3306
Aug 06 06:24:17 ip-172-26-8-210 node[895]: 
Aug 06 06:24:17 ip-172-26-8-210 node[895]: [2022-08-06 06:24:17] ERROR connect ECONNREFUSED ::1:3306
Aug 06 06:24:17 ip-172-26-8-210 node[895]: [2022-08-06 06:24:17] INFO Capturing error for worker during execution of job: email-analytics-fetch-latest
Aug 06 06:24:17 ip-172-26-8-210 node[895]: [2022-08-06 06:24:17] INFO Worker for job "email-analytics-fetch-latest" online
Aug 06 06:19:19 ip-172-26-8-210 node[895]: 
Aug 06 06:19:19 ip-172-26-8-210 node[895]:     at Worker.<computed>.onexit (node:internal/worker:198:20)
Aug 06 06:19:19 ip-172-26-8-210 node[895]:     at Worker.[kOnExit] (node:internal/worker:278:10)
Aug 06 06:19:19 ip-172-26-8-210 node[895]:     at Worker.emit (node:events:527:28)
Aug 06 06:19:19 ip-172-26-8-210 node[895]:     at Worker.<anonymous> (/var/www/sitename/versions/5.7.0/node_modules/bree/lib/index.js:419:40)
Aug 06 06:19:19 ip-172-26-8-210 node[895]: Error: Worker for job "email-analytics-fetch-latest" exited with code 1
Aug 06 06:19:19 ip-172-26-8-210 node[895]: ----------------------------------------
Aug 06 06:19:19 ip-172-26-8-210 node[895]: Worker for job "email-analytics-fetch-latest" exited with code 1
Aug 06 06:19:19 ip-172-26-8-210 node[895]: 
Aug 06 06:19:19 ip-172-26-8-210 node[895]: [2022-08-06 06:19:19] ERROR Worker for job "email-analytics-fetch-latest" exited with code 1
Aug 06 06:19:19 ip-172-26-8-210 node[895]: [2022-08-06 06:19:19] INFO Capturing error for worker during execution of job: email-analytics-fetch-latest
Aug 06 06:19:19 ip-172-26-8-210 node[895]: 
Aug 06 06:19:19 ip-172-26-8-210 node[895]:     at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1187:16)
Aug 06 06:19:19 ip-172-26-8-210 node[895]: Error: connect ECONNREFUSED ::1:3306
Aug 06 06:19:19 ip-172-26-8-210 node[895]: ----------------------------------------
Aug 06 06:19:19 ip-172-26-8-210 node[895]:     ECONNREFUSED
Aug 06 06:19:19 ip-172-26-8-210 node[895]: Error Code: 
Aug 06 06:19:19 ip-172-26-8-210 node[895]: connect ECONNREFUSED ::1:3306
Aug 06 06:19:19 ip-172-26-8-210 node[895]: 
Aug 06 06:19:19 ip-172-26-8-210 node[895]: [2022-08-06 06:19:19] ERROR connect ECONNREFUSED ::1:3306
Aug 06 06:19:19 ip-172-26-8-210 node[895]: [2022-08-06 06:19:19] INFO Capturing error for worker during execution of job: email-analytics-fetch-latest
Aug 06 06:19:18 ip-172-26-8-210 node[895]: [2022-08-06 06:19:18] INFO Worker for job "email-analytics-fetch-latest" online
Aug 06 06:15:52 ip-172-26-8-210 node[895]: 
Aug 06 06:15:52 ip-172-26-8-210 node[895]:     at Worker.<computed>.onexit (node:internal/worker:198:20)
Aug 06 06:15:52 ip-172-26-8-210 node[895]:     at Worker.[kOnExit] (node:internal/worker:278:10)
Aug 06 06:15:52 ip-172-26-8-210 node[895]:     at Worker.emit (node:events:527:28)
Aug 06 06:15:52 ip-172-26-8-210 node[895]:     at Worker.<anonymous> (/var/www/sitename/versions/5.7.0/node_modules/bree/lib/index.js:419:40)
Aug 06 06:15:52 ip-172-26-8-210 node[895]: Error: Worker for job "email-analytics-fetch-latest" exited with code 1
Aug 06 06:15:52 ip-172-26-8-210 node[895]: ----------------------------------------
Aug 06 06:15:52 ip-172-26-8-210 node[895]: Worker for job "email-analytics-fetch-latest" exited with code 1
Aug 06 06:15:52 ip-172-26-8-210 node[895]: 
Aug 06 06:15:52 ip-172-26-8-210 node[895]: [2022-08-06 06:15:49] ERROR Worker for job "email-analytics-fetch-latest" exited with code 1
Aug 06 06:15:52 ip-172-26-8-210 node[895]: [2022-08-06 06:15:49] INFO Capturing error for worker during execution of job: email-analytics-fetch-latest

anujdeshpande avatar Aug 06 '22 12:08 anujdeshpande

Also, mailgun shows everythings working fine at their end image

anujdeshpande avatar Aug 06 '22 12:08 anujdeshpande

ECONNREFUSED ::1:3306

3306 is usually the port for MySQL. This suggests that perhaps the job is unable to complete properly because it can't update the database?

Obviously, we don't really want it to max out CPU in this case, but in terms of finding the root cause, the DB connection looks like the first place to check.

ErisDS avatar Aug 08 '22 09:08 ErisDS

@ErisDS I am running mysql on the same lightsail instance, and other than this there's been no issues with that. Do you think I should check mysql logs specifically?

anujdeshpande avatar Aug 08 '22 10:08 anujdeshpande

I don't know what the root cause is here - this is a weird one. It's something we've seen before but never found a reliable reproduction case for and so haven't been able to solve.

I'm hoping your MySQL logs might turn up some clue 🕵️‍♀️ so yeah. if you could dig through them and share anything interesting that would be great.

ErisDS avatar Aug 08 '22 19:08 ErisDS

Nothing interesting so far from the mysql logs .. Had the same issue happen once since the last time I reported it. So this issue hasn't disappeared, but the logs aren't helpful either

From /var/log/mysql/

2022-08-15T09:35:48.028008Z 0 [System] [MY-010116] [Server] /usr/sbin/mysqld (mysqld 8.0.30-0ubuntu0.20.04.2) starting as process 39277
2022-08-15T09:35:49.146735Z 1 [System] [MY-013576] [InnoDB] InnoDB initialization has started.
2022-08-15T09:36:14.336756Z 1 [System] [MY-013577] [InnoDB] InnoDB initialization has ended.
2022-08-15T09:36:30.800538Z 0 [System] [MY-010229] [Server] Starting XA crash recovery...
2022-08-15T09:36:31.481277Z 0 [System] [MY-010232] [Server] XA crash recovery finished.
2022-08-15T09:36:34.673131Z 0 [Warning] [MY-010068] [Server] CA certificate ca.pem is self signed.
2022-08-15T09:36:34.673208Z 0 [System] [MY-013602] [Server] Channel mysql_main configured to support TLS. Encrypted connections are now supported for this channel.
2022-08-15T09:36:37.833939Z 0 [System] [MY-011323] [Server] X Plugin ready for connections. Bind-address: '127.0.0.1' port: 33060, socket: /var/run/mysqld/mysqlx.sock
2022-08-15T09:36:37.834289Z 0 [System] [MY-010931] [Server] /usr/sbin/mysqld: ready for connections. Version: '8.0.30-0ubuntu0.20.04.2'  socket: '/var/run/mysqld/mysqld.sock'  port: 3306  (Ubuntu).

From journalctl -

root@ip-172-26-8-210:/var/www/sitename# journalctl -u mysql.service
-- Logs begin at Tue 2022-08-02 16:06:35 UTC, end at Tue 2022-08-16 05:22:33 UTC. --
Aug 02 16:22:21 ip-172-26-8-210 systemd[1]: Starting MySQL Community Server...
Aug 02 16:22:23 ip-172-26-8-210 systemd[1]: Started MySQL Community Server.
Aug 02 17:47:33 ip-172-26-8-210 systemd[1]: Stopping MySQL Community Server...
Aug 02 17:47:37 ip-172-26-8-210 systemd[1]: mysql.service: Succeeded.
Aug 02 17:47:37 ip-172-26-8-210 systemd[1]: Stopped MySQL Community Server.
-- Reboot --
Aug 02 17:48:25 ip-172-26-8-210 systemd[1]: Starting MySQL Community Server...
Aug 02 17:48:30 ip-172-26-8-210 systemd[1]: Started MySQL Community Server.
Aug 05 03:31:10 ip-172-26-8-210 systemd[1]: Stopping MySQL Community Server...
Aug 05 03:31:21 ip-172-26-8-210 systemd[1]: mysql.service: Succeeded.
Aug 05 03:31:21 ip-172-26-8-210 systemd[1]: Stopped MySQL Community Server.
-- Reboot --
Aug 05 03:36:53 ip-172-26-8-210 systemd[1]: Starting MySQL Community Server...
Aug 05 03:37:00 ip-172-26-8-210 systemd[1]: Started MySQL Community Server.
-- Reboot --
Aug 06 12:16:52 ip-172-26-8-210 systemd[1]: Starting MySQL Community Server...
Aug 06 12:16:59 ip-172-26-8-210 systemd[1]: Started MySQL Community Server.
Aug 15 09:36:37 ip-172-26-8-210 systemd[1]: Started MySQL Community Server.
Aug 16 02:41:24 ip-172-26-8-210 systemd[1]: Stopping MySQL Community Server...
Aug 16 02:41:25 ip-172-26-8-210 systemd[1]: mysql.service: Succeeded.
Aug 16 02:41:25 ip-172-26-8-210 systemd[1]: Stopped MySQL Community Server.
-- Reboot --
Aug 16 02:42:19 ip-172-26-8-210 systemd[1]: Starting MySQL Community Server...
Aug 16 02:42:26 ip-172-26-8-210 systemd[1]: Started MySQL Community Server.

anujdeshpande avatar Aug 16 '22 05:08 anujdeshpande

Hey @anujdeshpande - we've identified an issue with the email analytics job that can cause it to stall indefinitely if there's an issue contacting Mailgun. It's been fixed and will be out in the next release of Ghost.

Your error message does seem to be slightly different to the one I was seeing, but it could be a cascading effect from the bug we've found. I'd make sure MySQL is healthy on your machine, update to the next release of Ghost when it's live, and see how it goes. If you're still having problems, ping back here and let us know 🙂

daniellockyer avatar Aug 24 '22 09:08 daniellockyer

Hi, I've been on the latest release consistently since August but I continue to experience the same issue. MySQL instance is healthy and there's a decent amount of CPU available.

Does anyone have other suggestions?

tseboho avatar Nov 04 '22 09:11 tseboho

@tseboho that's strange. I created this issue originally, and I haven't had this show up again since @daniellockyer said they fixed it, and I pulled in the very next release after that for my site. So perhaps the issue I faced was fixed, and you are facing something else?

anujdeshpande avatar Nov 07 '22 03:11 anujdeshpande

@anujdeshpande it might just be the case, but it sure looks very much alike 🤔 I'll keep debugging and possibly post my findings here.

tseboho avatar Nov 07 '22 13:11 tseboho