operating-system icon indicating copy to clipboard operation
operating-system copied to clipboard

After upgrading to 12.0 the system hangs

Open daviddesmet opened this issue 1 year ago • 15 comments
trafficstars

Describe the issue you are experiencing

I just upgraded to 12.0 and noticed the frontend refused to load. I plugged the mini PC directly into the monitor and rebooted, I got into the Home Assistant CLI. From there I was able to issue some commands and check the frontend (everything loads) and on just after a couple of minutes it just hangs, it doesn't respond to any keyboard input and the frontend is also unresponsive (doesn't load).

I've been running HA for quite some time in this mini PC, no issues till I upgraded. CPU was normally at 2-3 % use, and RAM at 2.5 GB of 32 GB and 12% storage use.

There's nothing in the home-assistant.log.1 that shows an issue, I wonder if I'm able to rollback or something since I'm able to get into the terminal and the HA CLI before it hangs.

What operating system image do you use?

generic-x86-64 (Generic UEFI capable x86-64 systems)

What version of Home Assistant Operating System is installed?

12.0

Did you upgrade the Operating System.

Yes

Steps to reproduce the issue

  1. Upgrade
  2. Wait
  3. Game over

Anything in the Supervisor logs that might be useful for us?

Only a warning about no valid ingress session.

Anything in the Host logs that might be useful for us?

Nothing.

System information

System Information

version core-2024.2.4
installation_type Home Assistant OS
dev false
hassio true
docker true
user root
virtualenv false
python_version 3.12.1
os_name Linux
os_version 6.6.16-haos
arch x86_64
timezone America/Mexico_City
config_dir /config
Home Assistant Community Store
GitHub API ok
GitHub Content ok
GitHub Web ok
GitHub API Calls Remaining 4983
Installed Version 1.34.0
Stage running
Available Repositories 1410
Downloaded Repositories 28
HACS Data ok
AccuWeather
can_reach_server ok
remaining_requests 44
Home Assistant Cloud
logged_in true
subscription_expiration March 3, 2024 at 18:00
relayer_connected true
relayer_region us-east-1
remote_enabled true
remote_connected true
alexa_enabled true
google_enabled true
remote_server
certificate_status ready
instance_id
can_reach_cert_server ok
can_reach_cloud_auth ok
can_reach_cloud ok
Home Assistant Supervisor
host_os Home Assistant OS 12.0
update_channel stable
supervisor_version supervisor-2024.02.0
agent_version 1.6.0
docker_version 24.0.7
disk_total 234.0 GB
disk_used 28.7 GB
healthy true
supported true
board generic-x86-64
supervisor_api ok
version_api ok
installed_addons MariaDB (2.6.1), Studio Code Server (5.15.0), File editor (5.8.0), Advanced SSH & Web Terminal (17.1.1), Node-RED (17.0.7), Home Assistant Google Drive Backup (0.112.1), Mosquitto broker (6.4.0), Nginx Proxy Manager (1.0.1), AdGuard Home (5.0.3), Cloudflared (5.1.4), InfluxDB (5.0.0), Grafana (9.1.3), Glances (0.21.0), Zigbee2MQTT (1.35.3-1), Grott stable branch (2.7) (0.1.7), Frigate (0.13.2), Uptime Kuma (0.12.0)
Dashboards
dashboards 4
resources 13
views 30
mode storage
Recorder
oldest_recorder_run February 19, 2024 at 15:56
current_recorder_run February 26, 2024 at 15:34
estimated_db_size 346.70 MiB
database_engine mysql
database_version 10.6.12

Additional information

No response

daviddesmet avatar Feb 26 '24 21:02 daviddesmet

I just experimented with stopping some add-ons and noticed the system no longer hangs when I turn off Frigate. I've been running Frigate for a while with an Edge TPU, the resource usage is very low so I find it strange that it is somehow now crashing the host. Will dig a bit more...

daviddesmet avatar Feb 26 '24 22:02 daviddesmet

Can you maybe check Host logs when enabling Frigate?

We did update the kernel, and Edge TPU needs a custom driver which got updated as well. But maybe that new version is buggy 🤔

agners avatar Feb 26 '24 22:02 agners

Hmmm, this is interesting...

I started the Frigate add-on and observed the host logs but didn't show anything new. However, Frigate logs showed a lot of errors trying to read the frames from the cameras until the add-on crashed. I had this time Watchdog disabled, so the add-on wasn't started again. After a couple of minutes, I noticed HA did crash so I had to do a manual reboot.

I've reproduced several times and no useful logs showing up for the host. I used the Terminal add-on and also from the host itself (monitor and keyboard connected directly).

On the last tries, I noticed the system was not hung up but very slow. Each character typed was showing around 15-20 seconds later. Still, no useful logs.

So, it seems your assumption about the driver is correct since it made the OS unresponsive before Frigate stopped, so not related to the Frigate process itself.

daviddesmet avatar Feb 26 '24 23:02 daviddesmet

Some additional information:

I use the M.2 Accelerator A+E key, I swapped the WiFi PCIe card with this one.

It only needs Frigate to be started once. I haven't started the add-on since then and the system is so far stable.

daviddesmet avatar Feb 26 '24 23:02 daviddesmet

I have similar symptoms, but I have no frigate, and do not know how to troubleshoot!

jesson20121020 avatar Feb 27 '24 07:02 jesson20121020

@jesson20121020 this issue is clearly Frigate/Edge TPU accelerator related, please open a new issue for your case along with all information (detailed symptom description as well as the type of system you are using).

agners avatar Feb 27 '24 09:02 agners

@daviddesmet that is a very interesting observation. Sounds as if the new Linux 6.6 kernel in combination with Edge TPU and the particular PCIe port triggers it? :thinking:

Is the accelerator still used, or is maybe Frigate not using the accelerator since the port change :thinking:

Also, is Frigate without the accelerator on HAOS 12.0 stable otherwise?

In the misbehaving setting, do you see increased memory or CPU usage?

agners avatar Feb 27 '24 09:02 agners

I am pretty new to HAOS, but experienced the same issue with freezing after upgrading to 12.0 from 11.5. I am running it on a VM in TrueNas Scale.

Symptoms :

  • Console access (Spice) is unresponsive to keyboard.
  • Serial Shell is responsive to keyboard, but any command I issue seems to hang.
  • Web UI is loading some pages, but many are not loading properly.
  • I get notifications on the bottom left that integration XYZ are loading, but it eventually goes to a more generic one that keeps coming.
  • I tried installing 12.0 from scratch then restore my backup (partial or complete) with the same result after a bit.

Hopefully that helps. Have now reverted to a Snapshot of my VM to restore things up.

Add-ons :

  • Advanced SSH & Web Terminal
  • Cloudflared
  • Studio Code Server (I feel like this is the culprit)

Integrations (Other than default) :

  • Cync Lights
  • Dreame Vacuum
  • Ecobee
  • HACS
  • Jandy iAqualink
  • Orbit B-hyve
  • Ring
  • Roku
  • Simplisafe
  • SmartThings
  • TP-Link Omada
  • Tuya

gjobin avatar Feb 27 '24 14:02 gjobin

@gjobin it seems you are not using a Edge TPU or Frigate add-on, so this is unlikely related with this issue. Please open a new issue so we can investigate separately.

agners avatar Feb 27 '24 14:02 agners

@agners I got some good and bad news.

The good news is that it doesn't seem to be related to the TPU, the bad news, I had disabled and used the CPU instead and experienced the same issue.

In the graph below, you can see a spike in RAM usage when starting Frigate with TPU enabled. As soon as it made the system unstable, I rebooted, disabled the TPU and started Frigate again, the same spike in RAM:

image

image

TPU disabled code:

# detectors:
#   coral:
#     type: edgetpu
#     device: pci:0

Frigate version is 0.13.2, it has been running since the update to 12.0.

daviddesmet avatar Feb 27 '24 15:02 daviddesmet

So obviously something in the Frigate add-on is misbehaving. You can try checking the memory usage of the processes running in the container by running docker exec -ti addon_ccab4aaf_frigate top directly on the host, hopefully that will reveal the process that's responsible.

sairon avatar Feb 29 '24 10:02 sairon

i'll try to gather some evidence on this too. i'm having the same problems which at first i thought was a disk corruption, but after disabling frigate for a while i found the crashes stopped. Issues only started after updating OS though. Frigate version remained the same.

TomK avatar Mar 05 '24 09:03 TomK

possibly related to #3206 No crashing after spending the last week with the frigate addon disabled. I re-enabled it yesterday and it crashed within a few minutes.

After a bit of tinkering I managed to resolve my system crashes by switching away from the "full access" version of frigate, effectively reinstating "protected mode" in the addon.

TomK avatar Mar 13 '24 17:03 TomK

That's interesting, I don't use the Frigate (Full Access) add-on, I've been using the one just called Frigate and have it left as disabled since the issue came out, as soon as I re-enable it, it crashes and I have to manually reboot. There's no "protection mode' toggle on the one I got installed.

I've tried with every update of HA to see if it gets fixed, but so far, it behaves the same for me.

daviddesmet avatar Mar 29 '24 19:03 daviddesmet

I believe I am having a similar problem to this.

Running haos on an old i7 Intel laptop with the Wi-Fi card replaced with a Google coral tpu. Running frigate on it in unprotected mode.

Every so often (sometimes once a day, most often about once a week) I can't access home assistant. I can see the cli but it is frozen. Only way to get back to home assistant is to hard reset the laptop. Nothing in the home assistant or firgate logs (set to debug) that could be causing this.

I can't seem to figure out how to get to the host log after crash. If someone can point me to some documentation on how to do this, I would be happy to do some digging/monitoring to help get it resolved.

jarkastr avatar May 11 '24 05:05 jarkastr

There hasn't been any activity on this issue recently. To keep our backlog manageable we have to clean old issues, as many of them have already been resolved with the latest updates. Please make sure to update to the latest Home Assistant OS version and check if that solves the issue. Let us know if that works for you by adding a comment 👍 This issue has now been marked as stale and will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Aug 10 '24 05:08 github-actions[bot]