supervisor supervisor doesn't restart node-red

supervisor doesn't restart node-red

Open b3nj1 opened this issue 1 year ago • 7 comments

Describe the issue you are experiencing

With the addon watchdog enabled, a crashed addon (node-red) does not get restarted. This is the same issue identified in https://github.com/home-assistant/supervisor/issues/3812 is still present with Supervisor 2022.09.1 which includes the fix https://github.com/home-assistant/supervisor/pull/3815 .

What is the used version of the Supervisor?

Supervisor-2022.09.1

What type of installation are you running?

Home Assistant OS

Which operating system are you running on?

Home Assistant Operating System

What is the version of your installed operating system?

9.0

What version of Home Assistant Core is installed?

core-2022.9.4

Steps to reproduce the issue

crash addon

Anything in the Supervisor logs that might be useful for us?

22-09-19 15:36:44 INFO (MainThread) [supervisor.bootstrap] Initializing Supervisor Sentry
22-09-19 08:36:45 INFO (MainThread) [__main__] Setting up Supervisor
22-09-19 08:36:49 INFO (MainThread) [__main__] Running Supervisor

Additional information

The add-on log has the following tail. Note that this time of the crash is just before the Supervisor INFO message above for Setting up Supervisor

19 Sep 08:36:33 - [red] Uncaught Exception:
19 Sep 08:36:33 - [error] UnhandledPromiseRejection: This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). The promise rejected with the reason "#<Object>".
s6-rc: info: service legacy-services: stopping
s6-svwait: fatal: supervisor died
s6-rc: info: service legacy-services successfully stopped
s6-rc: info: service legacy-cont-init: stopping
s6-rc: info: service legacy-cont-init successfully stopped
s6-rc: info: service fix-attrs: stopping
s6-rc: info: service fix-attrs successfully stopped
s6-rc: info: service s6rc-oneshot-runner: stopping
s6-rc: info: service s6rc-oneshot-runner successfully stopped

Sep 19 '22 17:09 b3nj1

Are you saying that Node RED is starting up before supervisor after a reboot? If so can you clarify how you got in that situation? That's never supposed to happen.

Sep 19 '22 20:09 mdegat01

@mdegat01 , No, the add-on starts in the right order. After running for days, the add-on crashed, but it was not restarted by the watchdog mechanism.

Sep 19 '22 20:09 b3nj1

The add-on log has the following tail. Note that this time of the crash is just before the Supervisor INFO message above for Setting up Supervisor

What did you mean by this then?

Sep 19 '22 20:09 mdegat01

What did you mean by this then?

It just seemed relevant, but maybe it is a red herring.

I did not initiate anything with the supervisor at that time (that is: I didn't reboot the host, restart HA, or anything like that). It seems interesting that supervisor reported that and node-red crashed around the same time.

The state prior to that time was: Node-red was running fine. The state after that time: Node-red was stopped. I manually started it about an hour later.

Sep 19 '22 21:09 b3nj1

I guess can you share more details about what you're doing in Node RED and how you configured the addon? I just installed Node RED, turned on watchdog and ran it. Then I forced it to crash by putting this in a function node and running it:

async function asyncCall() {
  throw Error("bad!")
}

asyncCall();

Resulting in this log:

20 Sep 12:40:36 - [red] Uncaught Exception:
20 Sep 12:40:36 - [error] Error: bad!
    at asyncCall (Function node:b07f582e2f3b1a5b [function 1]:3:9)
    at Function node:b07f582e2f3b1a5b [function 1]:6:1
    at Function node:b07f582e2f3b1a5b [function 1]:8:3
    at Script.runInContext (node:vm:139:12)
    at processMessage (/opt/node_modules/@node-red/nodes/core/function/10-function.js:401:33)
    at FunctionNode._inputCallback (/opt/node_modules/@node-red/nodes/core/function/10-function.js:335:17)
    at /opt/node_modules/@node-red/runtime/lib/nodes/Node.js:210:26
    at Object.trigger (/opt/node_modules/@node-red/util/lib/hooks.js:166:13)
    at FunctionNode.Node._emitInput (/opt/node_modules/@node-red/runtime/lib/nodes/Node.js:202:11)
    at FunctionNode.Node.emit (/opt/node_modules/@node-red/runtime/lib/nodes/Node.js:186:25)
[12:40:36] WARNING: Node-RED crashed, halting add-on
[12:40:36] INFO: Node-RED stopped, restarting...
s6-rc: info: service legacy-services: stopping
[12:40:36] INFO: Node-RED stopped, restarting...
s6-svwait: fatal: supervisor died
s6-rc: info: service legacy-services successfully stopped
s6-rc: info: service legacy-cont-init: stopping
s6-rc: info: service legacy-cont-init successfully stopped
s6-rc: info: service fix-attrs: stopping
s6-rc: info: service fix-attrs successfully stopped
s6-rc: info: service s6rc-oneshot-runner: stopping
s6-rc: info: service s6rc-oneshot-runner successfully stopped

Which results in this in the supervisor log:

22-09-20 12:40:40 DEBUG (MainThread) [supervisor.bus] Fire event 'BusEvent.DOCKER_CONTAINER_STATE_CHANGE' with 'DockerContainerStateEvent(name='addon_a0d7b954_nodered', state=<ContainerState.STOPPED: 'stopped'>, id='2715f50484a5b9f83bfaa0610f0ca047d2176de5d774575cba1912bcad5238ce', time=1663692040)'
22-09-20 12:40:40 DEBUG (MainThread) [supervisor.homeassistant.websocket] Sending: {'type': <WSType.SUPERVISOR_EVENT: 'supervisor/event'>, 'data': {'event': <WSEvent.ADDON: 'addon'>, 'slug': 'a0d7b954_nodered', 'state': <AddonState.STOPPED: 'stopped'>}, 'id': 17}
22-09-20 12:40:40 DEBUG (MainThread) [supervisor.homeassistant.websocket] Received: WSMessage(type=<WSMsgType.TEXT: 1>, data='{"id":17,"type":"result","success":true,"result":null}', extra='')
22-09-20 12:40:40 WARNING (MainThread) [supervisor.addons.addon] Watchdog found addon Node-RED is stopped, restarting...
22-09-20 12:40:40 DEBUG (SyncWorker_2) [supervisor.plugins.dns] Removing host entry 172.30.32.1 - ['a0d7b954-nodered', 'a0d7b954-nodered.local.hass.io']
22-09-20 12:40:40 INFO (SyncWorker_2) [supervisor.docker.interface] Cleaning addon_a0d7b954_nodered application
22-09-20 12:40:40 DEBUG (MainThread) [supervisor.addons.addon] Add-on a0d7b954_nodered write options: {'certfile': 'fullchain.pem', 'credential_secret': 'oy5e87yhkbs54uihbghdytrdkjbk', 'http_node': {'password': '', 'username': ''}, 'http_static': {'password': '', 'username': ''}, 'init_commands': [], 'keyfile': 'privkey.pem', 'npm_packages': [], 'ssl': True, 'system_packages': [], 'theme': 'default'}
22-09-20 12:40:40 DEBUG (MainThread) [supervisor.bus] Fire event 'BusEvent.DOCKER_CONTAINER_STATE_CHANGE' with 'DockerContainerStateEvent(name='addon_a0d7b954_nodered', state=<ContainerState.RUNNING: 'running'>, id='67b35f3f519ef59506f422a22435ce636203aebb0e68b000be74c6f799310bdc', time=1663692040)'
22-09-20 12:40:40 INFO (SyncWorker_7) [supervisor.docker.addon] Starting Docker add-on ghcr.io/hassio-addons/node-red/aarch64 with version 13.4.0

And Node RED is back up and running again.

So what is different on your end?

Sep 20 '22 16:09 mdegat01

Actually are you seeing repeated failures on startup by any chance? Like if you look in your supervisor log, do you see this line a bunch of times in a row due to repeated crashes on startup?:

22-09-20 12:40:40 WARNING (MainThread) [supervisor.addons.addon] Watchdog found addon Node-RED is stopped, restarting...

Because there is a limit to how many times watchdog will restart an addon in a time period before assuming something is very wrong and requires manual intervention. The limit is 10 times in 30 minutes, if that happens then watchdog will give up and assume something is broken and it can't fix it.

Sep 20 '22 16:09 mdegat01

Hi @mdegat01 , thanks for trying to reproduce this and the help!

What is different on your end

A few years worth of flows :) I have countless flows. I see the addon crash with UnhandledPromiseRejection issue once or twice a month. The node red log file isn't at all helpful in pinpointing what is causing the issue. Given how infrequently it occurs and how many things my node-red is doing, I think a more verbose logging mode might feasible.

I saw the supervisor issues I referenced above when I look to see if there were other cases of the symptom. When I was fixed, I figured I was in the clear since an occasional restart would be acceptable.

if you look in your supervisor log, do you see this line a bunch of times in a row due to repeated crashes on startup

I just rebooted my VM, and everything seems to be running fine including node-red. Prior to this reboot, there were no warnings about Watchdog found addon... in the log either.

Here is everything that shows in the log after my reboot.

s6-rc: info: service legacy-services: stopping
22-09-20 19:53:58 INFO (MainThread) [__main__] Closing Supervisor
[02:53:58] INFO: Watchdog restart after closing
s6-svwait: fatal: supervisor died
s6-rc: info: service legacy-services successfully stopped
s6-rc: info: service legacy-cont-init: stopping
s6-rc: info: service legacy-cont-init successfully stopped
s6-rc: info: service fix-attrs: stopping
s6-rc: info: service fix-attrs successfully stopped
s6-rc: info: service s6rc-oneshot-runner: stopping
s6-rc: info: service s6rc-oneshot-runner successfully stopped
[02:53:58] INFO: Supervisor restart after closing
s6-rc: info: service s6rc-oneshot-runner: starting
s6-rc: info: service s6rc-oneshot-runner successfully started
s6-rc: info: service fix-attrs: starting
s6-rc: info: service fix-attrs successfully started
s6-rc: info: service legacy-cont-init: starting
cont-init: info: running /etc/cont-init.d/udev.sh
[02:54:28] INFO: Using udev information from host
cont-init: info: /etc/cont-init.d/udev.sh exited 0
s6-rc: info: service legacy-cont-init successfully started
s6-rc: info: service legacy-services: starting
services-up: info: copying legacy longrun supervisor (no readiness notification)
services-up: info: copying legacy longrun watchdog (no readiness notification)
[02:54:28] INFO: Starting local supervisor watchdog...
s6-rc: info: service legacy-services successfully started
22-09-21 02:54:30 INFO (MainThread) [__main__] Initializing Supervisor setup
22-09-21 02:54:30 INFO (MainThread) [supervisor.bootstrap] Initializing Supervisor Sentry
22-09-20 19:54:30 INFO (MainThread) [__main__] Setting up Supervisor
22-09-20 19:54:35 INFO (MainThread) [__main__] Running Supervisor

Are the Watchdog restart after closing and the s6-svwait: fatal: supervisor died interesting?

Sep 21 '22 03:09 b3nj1

Is it possible that the order of events was:

supervisor crashed
node-red crashed as a result of 1.
supervisor restarted itself

And all we see is #2 and #3 in the log files?

Sep 27 '22 03:09 b3nj1

If supervisor crashed it should show up in supervisor's logs. Although admittedly the log viewer in the UI shows a pretty short subset of the logs so it might've rotated out of the view, something we're working on. If you can access the docker CLI you should be able to do docker logs hassio_supervisor to see much more. Or from the host system you can do journalctl CONTAINER_NAME=hassio_supervisor (might want to add a --since to this one or you'll get a lot, see man journalctl for more info/options).

I should also note there is an issue around host reboot that I think I fixed in https://github.com/home-assistant/supervisor/pull/3902 . Where containers would start before supervisor depending on how the host was rebooted. It seemed to come up particularly often in proxmox and VMs.

The behavior was a bit unpredictable since basically if the addon crashed before supervisor finished its initial setup then it was irrelevant. Supervisor would get to its normal "Start the addons" phase and start it. But if it didn't crash fast enough (or at all) then supervisor would notice it was already running and not start or restart it.

So if its most commonly occurring around host reboots that could be the issue. But if its happening while everything is just up and running then something else is going on.

Sep 27 '22 16:09 mdegat01

I'm using a VM.

I look forward to the #3902 fix because my MQTT add-on often does not function on startup of my VM.

I was able to get the hassio_supervisor logs using the old Portainer add-on, but they exactly start at the event above not before it. Here are the first 100 lines for the record.


2022-09-19T15:36:43.665416000Z s6-rc: info: service s6rc-oneshot-runner: starting
2022-09-19T15:36:43.670002000Z s6-rc: info: service s6rc-oneshot-runner successfully started
2022-09-19T15:36:43.670103000Z s6-rc: info: service fix-attrs: starting
2022-09-19T15:36:43.673989000Z s6-rc: info: service fix-attrs successfully started
2022-09-19T15:36:43.674198000Z s6-rc: info: service legacy-cont-init: starting
2022-09-19T15:36:43.677050000Z cont-init: info: running /etc/cont-init.d/udev.sh
2022-09-19T15:36:43.697863000Z [15:36:43] INFO: Using udev information from host
2022-09-19T15:36:43.698401000Z cont-init: info: /etc/cont-init.d/udev.sh exited 0
2022-09-19T15:36:43.699872000Z s6-rc: info: service legacy-cont-init successfully started
2022-09-19T15:36:43.700010000Z s6-rc: info: service legacy-services: starting
2022-09-19T15:36:43.705076000Z services-up: info: copying legacy longrun supervisor (no readiness notification)
2022-09-19T15:36:43.707299000Z services-up: info: copying legacy longrun watchdog (no readiness notification)
2022-09-19T15:36:43.730356000Z [15:36:43] INFO: Starting local supervisor watchdog...
2022-09-19T15:36:43.733730000Z s6-rc: info: service legacy-services successfully started
2022-09-19T15:36:44.563691000Z 22-09-19 15:36:44 INFO (MainThread) [__main__] Initializing Supervisor setup
2022-09-19T15:36:44.612586000Z 22-09-19 15:36:44 INFO (MainThread) [supervisor.bootstrap] Initializing Supervisor Sentry
2022-09-19T15:36:45.135369000Z 22-09-19 08:36:45 INFO (MainThread) [__main__] Setting up Supervisor
2022-09-19T15:36:49.850952000Z 22-09-19 08:36:49 INFO (MainThread) [__main__] Running Supervisor
2022-09-20T03:36:38.026309000Z s6-rc: info: service legacy-services: stopping
2022-09-20T03:36:38.181521000Z [03:36:38] INFO: Watchdog restart after closing
2022-09-20T03:36:38.181559000Z s6-svwait: fatal: supervisor died
2022-09-20T03:36:38.181584000Z s6-rc: info: service legacy-services successfully stopped
2022-09-20T03:36:38.181611000Z s6-rc: info: service legacy-cont-init: stopping
2022-09-20T03:36:38.212283000Z s6-rc: info: service legacy-cont-init successfully stopped
2022-09-20T03:36:38.212307000Z s6-rc: info: service fix-attrs: stopping
2022-09-20T03:36:38.212325000Z s6-rc: info: service fix-attrs successfully stopped
2022-09-20T03:36:38.212356000Z s6-rc: info: service s6rc-oneshot-runner: stopping
2022-09-20T03:36:38.220032000Z s6-rc: info: service s6rc-oneshot-runner successfully stopped
2022-09-20T03:38:02.053024000Z s6-rc: info: service s6rc-oneshot-runner: starting
2022-09-20T03:38:02.059334000Z s6-rc: info: service s6rc-oneshot-runner successfully started
2022-09-20T03:38:02.060055000Z s6-rc: info: service fix-attrs: starting
2022-09-20T03:38:02.073279000Z s6-rc: info: service fix-attrs successfully started
2022-09-20T03:38:02.073460000Z s6-rc: info: service legacy-cont-init: starting
2022-09-20T03:38:02.076072000Z cont-init: info: running /etc/cont-init.d/udev.sh
2022-09-20T03:38:02.135841000Z [03:38:02] INFO: Using udev information from host
2022-09-20T03:38:02.136221000Z cont-init: info: /etc/cont-init.d/udev.sh exited 0
2022-09-20T03:38:02.138304000Z s6-rc: info: service legacy-cont-init successfully started
2022-09-20T03:38:02.138499000Z s6-rc: info: service legacy-services: starting
2022-09-20T03:38:02.141677000Z services-up: info: copying legacy longrun supervisor (no readiness notification)
2022-09-20T03:38:02.145008000Z services-up: info: copying legacy longrun watchdog (no readiness notification)
2022-09-20T03:38:02.168437000Z [03:38:02] INFO: Starting local supervisor watchdog...
2022-09-20T03:38:02.173993000Z s6-rc: info: service legacy-services successfully started
2022-09-20T03:38:03.527524000Z 22-09-20 03:38:03 INFO (MainThread) [__main__] Initializing Supervisor setup
2022-09-20T03:38:03.590199000Z 22-09-20 03:38:03 INFO (MainThread) [supervisor.bootstrap] Initializing Supervisor Sentry
2022-09-20T03:38:03.617132000Z 22-09-19 20:38:03 INFO (MainThread) [__main__] Setting up Supervisor
2022-09-20T03:38:11.206719000Z 22-09-19 20:38:11 INFO (MainThread) [__main__] Running Supervisor
2022-09-20T03:38:16.609138000Z 22-09-19 20:38:16 ERROR (MainThread) [supervisor.utils] Can't execute run while a task is in progress
2022-09-20T03:38:17.687756000Z 22-09-19 20:38:17 ERROR (MainThread) [supervisor.api.middleware.security] Invalid token for access /addons/self/info
2022-09-20T03:38:18.410842000Z 22-09-19 20:38:18 ERROR (MainThread) [supervisor.api.middleware.security] Invalid token for access /info
2022-09-20T03:38:19.473680000Z 22-09-19 20:38:19 ERROR (MainThread) [supervisor.api.middleware.security] Invalid token for access /addons/self/options/config
2022-09-20T03:38:20.042167000Z 22-09-19 20:38:20 ERROR (MainThread) [supervisor.api.middleware.security] Invalid token for access /discovery
2022-09-20T03:38:20.211252000Z 22-09-19 20:38:20 ERROR (MainThread) [supervisor.api.middleware.security] Invalid token for access /addons/self/info
2022-09-20T03:43:48.989262000Z s6-rc: info: service legacy-services: stopping
2022-09-20T03:43:49.026202000Z 22-09-19 20:43:49 INFO (MainThread) [__main__] Closing Supervisor
2022-09-20T03:43:49.092146000Z [03:43:49] INFO: Watchdog restart after closing
2022-09-20T03:43:49.097349000Z s6-svwait: fatal: supervisor died
2022-09-20T03:43:49.099261000Z s6-rc: info: service legacy-services successfully stopped
2022-09-20T03:43:49.099412000Z s6-rc: info: service legacy-cont-init: stopping
2022-09-20T03:43:49.131762000Z s6-rc: info: service legacy-cont-init successfully stopped
2022-09-20T03:43:49.131888000Z s6-rc: info: service fix-attrs: stopping
2022-09-20T03:43:49.147922000Z s6-rc: info: service fix-attrs successfully stopped
2022-09-20T03:43:49.148737000Z s6-rc: info: service s6rc-oneshot-runner: stopping
2022-09-20T03:43:49.152494000Z 22-09-19 20:43:49 ERROR (MainThread) [asyncio] Task was destroyed but it is pending!
2022-09-20T03:43:49.152996000Z task: <Task pending name='Task-4193' coro=<process_lock.<locals>.wrap_api() running at /usr/src/supervisor/supervisor/utils/__init__.py:33> wait_for=<Future pending cb=[Task.task_wakeup()]>>
2022-09-20T03:43:49.174784000Z s6-rc: info: service s6rc-oneshot-runner successfully stopped
2022-09-20T03:43:49.206124000Z [03:43:49] INFO: Supervisor restart after closing
2022-09-20T03:44:21.578737000Z s6-rc: info: service s6rc-oneshot-runner: starting
2022-09-20T03:44:21.584621000Z s6-rc: info: service s6rc-oneshot-runner successfully started
2022-09-20T03:44:21.585328000Z s6-rc: info: service fix-attrs: starting
2022-09-20T03:44:21.601265000Z s6-rc: info: service fix-attrs successfully started
2022-09-20T03:44:21.601596000Z s6-rc: info: service legacy-cont-init: starting
2022-09-20T03:44:21.606623000Z cont-init: info: running /etc/cont-init.d/udev.sh
2022-09-20T03:44:21.669363000Z [03:44:21] INFO: Using udev information from host
2022-09-20T03:44:21.669796000Z cont-init: info: /etc/cont-init.d/udev.sh exited 0
2022-09-20T03:44:21.670438000Z s6-rc: info: service legacy-cont-init successfully started
2022-09-20T03:44:21.670881000Z s6-rc: info: service legacy-services: starting
2022-09-20T03:44:21.676314000Z services-up: info: copying legacy longrun supervisor (no readiness notification)
2022-09-20T03:44:21.679127000Z services-up: info: copying legacy longrun watchdog (no readiness notification)
2022-09-20T03:44:21.693252000Z s6-rc: info: service legacy-services successfully started
2022-09-20T03:44:21.718622000Z [03:44:21] INFO: Starting local supervisor watchdog...
2022-09-20T03:44:23.327317000Z 22-09-20 03:44:23 INFO (MainThread) [__main__] Initializing Supervisor setup
2022-09-20T03:44:23.419433000Z 22-09-20 03:44:23 INFO (MainThread) [supervisor.bootstrap] Initializing Supervisor Sentry
2022-09-20T03:44:23.456545000Z 22-09-19 20:44:23 INFO (MainThread) [__main__] Setting up Supervisor
2022-09-20T03:44:34.253269000Z 22-09-19 20:44:34 INFO (MainThread) [__main__] Running Supervisor
2022-09-20T03:50:09.182173000Z 22-09-19 20:50:09 ERROR (MainThread) [supervisor.services.modules.mqtt] There is already a MQTT service in use from core_mosquitto
2022-09-20T04:23:34.569299000Z s6-rc: info: service legacy-services: stopping
2022-09-20T04:23:34.597851000Z [04:23:34] INFO: Watchdog restart after closing
2022-09-20T04:23:34.598518000Z s6-svwait: fatal: supervisor died
2022-09-20T04:23:34.602744000Z s6-rc: info: service legacy-services successfully stopped
2022-09-20T04:23:34.603364000Z s6-rc: info: service legacy-cont-init: stopping
2022-09-20T04:23:34.607505000Z s6-rc: info: service legacy-cont-init successfully stopped
2022-09-20T04:23:34.610242000Z s6-rc: info: service fix-attrs: stopping
2022-09-20T04:23:34.612519000Z s6-rc: info: service fix-attrs successfully stopped
2022-09-20T04:23:34.613872000Z s6-rc: info: service s6rc-oneshot-runner: stopping
2022-09-20T04:23:34.622725000Z s6-rc: info: service s6rc-oneshot-runner successfully stopped
2022-09-20T04:24:53.221973000Z s6-rc: info: service s6rc-oneshot-runner: starting
2022-09-20T04:24:53.227387000Z s6-rc: info: service s6rc-oneshot-runner successfully started
2022-09-20T04:24:53.228089000Z s6-rc: info: service fix-attrs: starting
2022-09-20T04:24:53.236039000Z s6-rc: info: service fix-attrs successfully started

Sep 28 '22 04:09 b3nj1

There hasn't been any activity on this issue recently. Due to the high number of incoming GitHub notifications, we have to clean some of the old issues, as many of them have already been resolved with the latest updates. Please make sure to update to the latest version and check if that solves the issue. Let us know if that works for you by adding a comment 👍 This issue has now been marked as stale and will be closed if no further activity occurs. Thank you for your contributions.

Oct 28 '22 05:10 github-actions[bot]

supervisor supervisor copied to clipboard

supervisor doesn't restart node-red

Describe the issue you are experiencing

What is the used version of the Supervisor?

What type of installation are you running?

Which operating system are you running on?

What is the version of your installed operating system?

What version of Home Assistant Core is installed?

Steps to reproduce the issue

Anything in the Supervisor logs that might be useful for us?

Additional information

supervisor
supervisor copied to clipboard