operating-system icon indicating copy to clipboard operation
operating-system copied to clipboard

Lockup / Freeze on Optiplex 3060 (i5-8500T) - [i915] media: timed out waiting for forcewake ack request.

Open daernsinstantfortress opened this issue 1 year ago • 9 comments

Describe the issue you are experiencing

I'm experiencing periodic (random, but approximately every 1-2 weeks) hard lockups of HAOS, running bare-metal on a Dell Optiplex 3060 (i5-8500T) uSFF machine. It's been stable for months, but this has started over the last 6 weeks or so. The system is completely frozen, with no display on the console and no response to keyboard input. It requires a power cycle to recover, which it does cleanly.

The last occurrence reported the following in the logs (via journalctl):

Sep 29 15:10:23 homeassistant kernel: [drm:fw_domains_get_with_fallback [i915]] *ERROR* media: timed out waiting for forcewake ack request.
Sep 29 15:10:23 homeassistant kernel: i915 0000:00:02.0: [drm:add_taint_for_CI [i915]] CI tainted:0x9 by fw_domains_get_with_fallback+0x1d3/0x220 [i915]

This seems to indicate an issue related to the i915 integrated GPU. I am using Frigate as an add-on, with (CPU) accelerated video decoding, so it's possibly some interaction between HAOS and Frigate, although I'm using a Coral, rather than GPU, for object detection, so the i915 shouldn't be actively used.

Running Home Assistant OS 10.5 and latest (2023-9.3) HA core.

What operating system image do you use?

generic-x86-64 (Generic UEFI capable x86-64 systems)

What version of Home Assistant Operating System is installed?

Home Assistant OS 10.5

Did you upgrade the Operating System.

Yes

Steps to reproduce the issue

Sadly, wait for about 1-2 weeks on my hardware. I've not been able to force-reproduce the issue.

Anything in the Supervisor logs that might be useful for us?

Sep 29 15:10:23 homeassistant kernel: [drm:fw_domains_get_with_fallback [i915]] *ERROR* media: timed out waiting for forcewake ack request.
Sep 29 15:10:23 homeassistant kernel: i915 0000:00:02.0: [drm:add_taint_for_CI [i915]] CI tainted:0x9 by fw_domains_get_with_fallback+0x1d3/0x220 [i915]

Anything in the Host logs that might be useful for us?

Nothing useful

System information

System Information

version core-2023.9.3
installation_type Home Assistant OS
dev false
hassio true
docker true
user root
virtualenv false
python_version 3.11.5
os_name Linux
os_version 6.1.45
arch x86_64
timezone Europe/London
config_dir /config
Home Assistant Community Store
GitHub API ok
GitHub Content ok
GitHub Web ok
GitHub API Calls Remaining 5000
Installed Version 1.33.0
Stage running
Available Repositories 1297
Downloaded Repositories 12
Home Assistant Cloud
logged_in true
subscription_expiration 16 November 2023 at 00:00
relayer_connected true
relayer_region eu-central-1
remote_enabled true
remote_connected true
alexa_enabled false
google_enabled true
remote_server eu-central-1-10.ui.nabu.casa
certificate_status ready
can_reach_cert_server ok
can_reach_cloud_auth ok
can_reach_cloud ok
Home Assistant Supervisor
host_os Home Assistant OS 10.5
update_channel stable
supervisor_version supervisor-2023.09.2
agent_version 1.5.1
docker_version 23.0.6
disk_total 916.2 GB
disk_used 32.0 GB
healthy true
supported true
board generic-x86-64
supervisor_api ok
version_api ok
installed_addons Samba share (10.0.2), MariaDB (2.6.1), Advanced SSH & Web Terminal (15.0.8), ESPHome (2023.9.1), File editor (5.6.0), Frigate (Full Access) (0.12.1), Home Assistant Google Drive Backup (0.111.1), Let's Encrypt (4.12.9), Mosquitto broker (6.3.1), Network UPS Tools (0.12.1), Zigbee2MQTT (1.33.0-1), texecom2mqtt (1.2.3), HDD Tools (1.1.0)
Dashboards
dashboards 1
resources 4
views 12
mode storage
Recorder
oldest_recorder_run 25 September 2023 at 10:34
current_recorder_run 29 September 2023 at 17:13
estimated_db_size 623.70 MiB
database_engine mysql
database_version 10.6.12

Additional information

No response

daernsinstantfortress avatar Sep 29 '23 18:09 daernsinstantfortress

I've since had this issue again, this time repeating the first line of the error, but not the second, so there is some degree of consistency here:

drm:fw_domains_get_with_fallback [i915]] *ERROR* media: timed out waiting for forcewake ack request.

This failed with the Coral moved back to USB from M.2 a/e, so I can rule this change out as a potential cause (it was an outside chance anyway, given the error).

In the spirit of investigation, I've reverted to HAOS 10.4 and will monitor for any change in behaviour. I notice that HAOS 10.5 shipped with a new kernel, so I guess it's possible that this has changed behaviour in some way. Will update this case when I get any results, positive or otherwise.

daernsinstantfortress avatar Oct 01 '23 16:10 daernsinstantfortress

This seems to indicate an issue related to the i915 integrated GPU. I am using Frigate as an add-on, with (CPU) accelerated video decoding, so it's possibly some interaction between HAOS and Frigate, although I'm using a Coral, rather than GPU, for object detection, so the i915 shouldn't be actively used.

Maybe i915 is used for decoding or encoding video streams? :thinking:

In the spirit of investigation, I've reverted to HAOS 10.4 and will monitor for any change in behaviour. I notice that HAOS 10.5 shipped with a new kernel, so I guess it's possible that this has changed behaviour in some way. Will update this case when I get any results, positive or otherwise.

We usually update the kernel to the latest upstream stable kernel releases. These typically contain bug fixes, so shouldn't lead to regressions typically, so I don't expect that reverting to 10.4 will help here. But let's see :crossed_fingers:

agners avatar Oct 03 '23 12:10 agners

Maybe i915 is used for decoding or encoding video streams? 🤔

I don't believe so. My understanding from the Frigate docs is that this is CPU rather than GPU offloaded and that the GPU only gets used when using OpenVINO. I will check, however.

We usually update the kernel to the latest upstream stable kernel releases. These typically contain bug fixes, so shouldn't lead to regressions typically, so I don't expect that reverting to 10.4 will help here. But let's see 🤞

I've had a good scour around for other people reporting this defect with the 6.1.45 kernel and no such luck, so perhaps this is a red herring (or I'm just the unlucky person to first experience it!)

Either way, it's been rolled back to 10.4 / 6.1.39 since Sunday and nothing's crashed yet, but it's probably way too soon to make any judgements. I continue to monitor...

daernsinstantfortress avatar Oct 03 '23 13:10 daernsinstantfortress

Update: Coming up to 7 days since the rollback to 10.4 and it's been perfectly stable. Will keep it on this and continue to monitor and will report back.

daernsinstantfortress avatar Oct 07 '23 21:10 daernsinstantfortress

...aaaand spoke to soon. Crashed and froze overnight on 10.4 but with no specific errors in the log this time.

In a bid to just restore some stability (or at least not have it sit waiting for a power cycle to restart), one thing I'm pondering is that it's not panicing (HAOS is configured to reboot on panic) but is throwing an "oops", which HAOS is not configured to reboot on:

# sysctl kernel.panic_on_oops
kernel.panic_on_oops = 0

I've since updated /mnt/boot/cmdline.txt as follows: console=tty1 kernel.panic_on_oops=1 ...but this doesn't appear to be applied on reboot, so I need to update manually (sysctl kernel.panic_on_oops=1) which does seem to work. No idea why this entry in /mnt/boot/cmdline.txt isn't taking effect (it appears in /proc/cmdline), but I'm well and truly yak shaving now!

daernsinstantfortress avatar Oct 08 '23 19:10 daernsinstantfortress

Upgraded to 11.0 now. No failures so far (since RC), but continuing to monitor. There are some code changes in the latest kernel related to i915, so fingers crossed for a miracle here!

daernsinstantfortress avatar Oct 16 '23 09:10 daernsinstantfortress

Was lovely and stable on 11.0 for two weeks but updated to 11.1 last night and today have had an unexpected, and unwarned full OS reboot within 24 hours.

Only thing indicating a problem in the journalctl logs was a single line:

-- Boot 00000000000000000000000000000000 --

...which I admit is not very helpful at all!

Unlike the previous issue I had, this one actually rebooted automatically so the system returned to normal operation. I'll roll back to 11.0 for now and see if I can restore stability. I notice that these updates are running very, very recent Linux kernels (11.1 was running a one-week old kernel version - 6.1.59) - I appreciate that it's good to stay up to date, but does it really need to be this bleeding edge, especially given that kernel bugs are not unknown...?

daernsinstantfortress avatar Oct 30 '23 13:10 daernsinstantfortress

Subscribed to this because I'm having very similar problems. I'm running docker containers for hass and frigate on Ubuntu 22.04.3 LTS kernel 5.15.0-91-generic. USB Coral and an Intel 915 integrated graphics driver. Hard freezes are anywhere from every few weeks to every few months.

~~I'm trying out installing a newer set of i915 drivers via https://dgpu-docs.intel.com/driver/installation.html~~. <--- Don't do this, I lost video output entirely.

jshank avatar Jan 22 '24 22:01 jshank

There hasn't been any activity on this issue recently. To keep our backlog manageable we have to clean old issues, as many of them have already been resolved with the latest updates. Please make sure to update to the latest Home Assistant OS version and check if that solves the issue. Let us know if that works for you by adding a comment 👍 This issue has now been marked as stale and will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Apr 22 '24 05:04 github-actions[bot]