operating-system
operating-system copied to clipboard
Lockup / Freeze on Optiplex 3060 (i5-8500T) - [i915] media: timed out waiting for forcewake ack request.
Describe the issue you are experiencing
I'm experiencing periodic (random, but approximately every 1-2 weeks) hard lockups of HAOS, running bare-metal on a Dell Optiplex 3060 (i5-8500T) uSFF machine. It's been stable for months, but this has started over the last 6 weeks or so. The system is completely frozen, with no display on the console and no response to keyboard input. It requires a power cycle to recover, which it does cleanly.
The last occurrence reported the following in the logs (via journalctl):
Sep 29 15:10:23 homeassistant kernel: [drm:fw_domains_get_with_fallback [i915]] *ERROR* media: timed out waiting for forcewake ack request.
Sep 29 15:10:23 homeassistant kernel: i915 0000:00:02.0: [drm:add_taint_for_CI [i915]] CI tainted:0x9 by fw_domains_get_with_fallback+0x1d3/0x220 [i915]
This seems to indicate an issue related to the i915 integrated GPU. I am using Frigate as an add-on, with (CPU) accelerated video decoding, so it's possibly some interaction between HAOS and Frigate, although I'm using a Coral, rather than GPU, for object detection, so the i915 shouldn't be actively used.
Running Home Assistant OS 10.5 and latest (2023-9.3) HA core.
What operating system image do you use?
generic-x86-64 (Generic UEFI capable x86-64 systems)
What version of Home Assistant Operating System is installed?
Home Assistant OS 10.5
Did you upgrade the Operating System.
Yes
Steps to reproduce the issue
Sadly, wait for about 1-2 weeks on my hardware. I've not been able to force-reproduce the issue.
Anything in the Supervisor logs that might be useful for us?
Sep 29 15:10:23 homeassistant kernel: [drm:fw_domains_get_with_fallback [i915]] *ERROR* media: timed out waiting for forcewake ack request.
Sep 29 15:10:23 homeassistant kernel: i915 0000:00:02.0: [drm:add_taint_for_CI [i915]] CI tainted:0x9 by fw_domains_get_with_fallback+0x1d3/0x220 [i915]
Anything in the Host logs that might be useful for us?
Nothing useful
System information
System Information
version | core-2023.9.3 |
---|---|
installation_type | Home Assistant OS |
dev | false |
hassio | true |
docker | true |
user | root |
virtualenv | false |
python_version | 3.11.5 |
os_name | Linux |
os_version | 6.1.45 |
arch | x86_64 |
timezone | Europe/London |
config_dir | /config |
Home Assistant Community Store
GitHub API | ok |
---|---|
GitHub Content | ok |
GitHub Web | ok |
GitHub API Calls Remaining | 5000 |
Installed Version | 1.33.0 |
Stage | running |
Available Repositories | 1297 |
Downloaded Repositories | 12 |
Home Assistant Cloud
logged_in | true |
---|---|
subscription_expiration | 16 November 2023 at 00:00 |
relayer_connected | true |
relayer_region | eu-central-1 |
remote_enabled | true |
remote_connected | true |
alexa_enabled | false |
google_enabled | true |
remote_server | eu-central-1-10.ui.nabu.casa |
certificate_status | ready |
can_reach_cert_server | ok |
can_reach_cloud_auth | ok |
can_reach_cloud | ok |
Home Assistant Supervisor
host_os | Home Assistant OS 10.5 |
---|---|
update_channel | stable |
supervisor_version | supervisor-2023.09.2 |
agent_version | 1.5.1 |
docker_version | 23.0.6 |
disk_total | 916.2 GB |
disk_used | 32.0 GB |
healthy | true |
supported | true |
board | generic-x86-64 |
supervisor_api | ok |
version_api | ok |
installed_addons | Samba share (10.0.2), MariaDB (2.6.1), Advanced SSH & Web Terminal (15.0.8), ESPHome (2023.9.1), File editor (5.6.0), Frigate (Full Access) (0.12.1), Home Assistant Google Drive Backup (0.111.1), Let's Encrypt (4.12.9), Mosquitto broker (6.3.1), Network UPS Tools (0.12.1), Zigbee2MQTT (1.33.0-1), texecom2mqtt (1.2.3), HDD Tools (1.1.0) |
Dashboards
dashboards | 1 |
---|---|
resources | 4 |
views | 12 |
mode | storage |
Recorder
oldest_recorder_run | 25 September 2023 at 10:34 |
---|---|
current_recorder_run | 29 September 2023 at 17:13 |
estimated_db_size | 623.70 MiB |
database_engine | mysql |
database_version | 10.6.12 |
Additional information
No response
I've since had this issue again, this time repeating the first line of the error, but not the second, so there is some degree of consistency here:
drm:fw_domains_get_with_fallback [i915]] *ERROR* media: timed out waiting for forcewake ack request.
This failed with the Coral moved back to USB from M.2 a/e, so I can rule this change out as a potential cause (it was an outside chance anyway, given the error).
In the spirit of investigation, I've reverted to HAOS 10.4 and will monitor for any change in behaviour. I notice that HAOS 10.5 shipped with a new kernel, so I guess it's possible that this has changed behaviour in some way. Will update this case when I get any results, positive or otherwise.
This seems to indicate an issue related to the i915 integrated GPU. I am using Frigate as an add-on, with (CPU) accelerated video decoding, so it's possibly some interaction between HAOS and Frigate, although I'm using a Coral, rather than GPU, for object detection, so the i915 shouldn't be actively used.
Maybe i915 is used for decoding or encoding video streams? :thinking:
In the spirit of investigation, I've reverted to HAOS 10.4 and will monitor for any change in behaviour. I notice that HAOS 10.5 shipped with a new kernel, so I guess it's possible that this has changed behaviour in some way. Will update this case when I get any results, positive or otherwise.
We usually update the kernel to the latest upstream stable kernel releases. These typically contain bug fixes, so shouldn't lead to regressions typically, so I don't expect that reverting to 10.4 will help here. But let's see :crossed_fingers:
Maybe i915 is used for decoding or encoding video streams? 🤔
I don't believe so. My understanding from the Frigate docs is that this is CPU rather than GPU offloaded and that the GPU only gets used when using OpenVINO. I will check, however.
We usually update the kernel to the latest upstream stable kernel releases. These typically contain bug fixes, so shouldn't lead to regressions typically, so I don't expect that reverting to 10.4 will help here. But let's see 🤞
I've had a good scour around for other people reporting this defect with the 6.1.45 kernel and no such luck, so perhaps this is a red herring (or I'm just the unlucky person to first experience it!)
Either way, it's been rolled back to 10.4 / 6.1.39 since Sunday and nothing's crashed yet, but it's probably way too soon to make any judgements. I continue to monitor...
Update: Coming up to 7 days since the rollback to 10.4 and it's been perfectly stable. Will keep it on this and continue to monitor and will report back.
...aaaand spoke to soon. Crashed and froze overnight on 10.4 but with no specific errors in the log this time.
In a bid to just restore some stability (or at least not have it sit waiting for a power cycle to restart), one thing I'm pondering is that it's not panicing (HAOS is configured to reboot on panic) but is throwing an "oops", which HAOS is not configured to reboot on:
# sysctl kernel.panic_on_oops
kernel.panic_on_oops = 0
I've since updated /mnt/boot/cmdline.txt as follows:
console=tty1 kernel.panic_on_oops=1
...but this doesn't appear to be applied on reboot, so I need to update manually (sysctl kernel.panic_on_oops=1
) which does seem to work. No idea why this entry in /mnt/boot/cmdline.txt isn't taking effect (it appears in /proc/cmdline), but I'm well and truly yak shaving now!
Upgraded to 11.0 now. No failures so far (since RC), but continuing to monitor. There are some code changes in the latest kernel related to i915, so fingers crossed for a miracle here!
Was lovely and stable on 11.0 for two weeks but updated to 11.1 last night and today have had an unexpected, and unwarned full OS reboot within 24 hours.
Only thing indicating a problem in the journalctl logs was a single line:
-- Boot 00000000000000000000000000000000 --
...which I admit is not very helpful at all!
Unlike the previous issue I had, this one actually rebooted automatically so the system returned to normal operation. I'll roll back to 11.0 for now and see if I can restore stability. I notice that these updates are running very, very recent Linux kernels (11.1 was running a one-week old kernel version - 6.1.59) - I appreciate that it's good to stay up to date, but does it really need to be this bleeding edge, especially given that kernel bugs are not unknown...?
Subscribed to this because I'm having very similar problems. I'm running docker containers for hass and frigate on Ubuntu 22.04.3 LTS kernel 5.15.0-91-generic. USB Coral and an Intel 915 integrated graphics driver. Hard freezes are anywhere from every few weeks to every few months.
~~I'm trying out installing a newer set of i915 drivers via https://dgpu-docs.intel.com/driver/installation.html~~. <--- Don't do this, I lost video output entirely.
There hasn't been any activity on this issue recently. To keep our backlog manageable we have to clean old issues, as many of them have already been resolved with the latest updates. Please make sure to update to the latest Home Assistant OS version and check if that solves the issue. Let us know if that works for you by adding a comment 👍 This issue has now been marked as stale and will be closed if no further activity occurs. Thank you for your contributions.