supervisor icon indicating copy to clipboard operation
supervisor copied to clipboard

Supervisor cannot Handle NFS Mounts with Stale File Handles

Open AngellusMortis opened this issue 1 year ago • 5 comments

Describe the issue you are experiencing

Occasionally (usually at least once a day), my NFS backup mount will just disconnect, and I will get a repair saying it failed. Reloading does not work, and it says check the logs. The logs say it is a stale file handle. The only way to resolve the issue is to reboot the host for HAOS. 

I know I am using Unraid and Unraid has been shown to not have the best performance, but I have 9 machines (7 Linux and 2 Windows) connected to this server via NFS and Home Assistant is the only one that that just breaks regularly so there seems to be something wrong in how HA Supervisor is managing the NFS mount.

It is rather annoying needing to reboot HA potentially multiple times a day.

What type of installation are you running?

Home Assistant OS

Which operating system are you running on?

Home Assistant Operating System

Steps to reproduce the issue

  1. Setup NFS Mount
  2. ???
  3. It Breaks

Anything in the Supervisor logs that might be useful for us?

Supervisor logs:

24-03-27 08:15:30 INFO (MainThread) [supervisor.mounts.manager] Reloading mount: backup
24-03-27 08:15:30 ERROR (MainThread) [supervisor.mounts.mount] Reloading backup did not succeed. Check host logs for errors from mount or systemd unit mnt-data-supervisor-mounts-backup.mount for details.
24-03-27 08:15:38 INFO (MainThread) [supervisor.mounts.manager] Reloading mount: backup
24-03-27 08:15:38 ERROR (MainThread) [supervisor.mounts.mount] Reloading backup did not succeed. Check host logs for errors from mount or systemd unit mnt-data-supervisor-mounts-backup.mount for details.

Host logs:

Mar 27 12:15:30 home systemd[1]: Reloading Supervisor nfs mount: backup...
Mar 27 12:15:30 home mount[2768611]: mount.nfs: Stale file handle
Mar 27 12:15:30 home systemd[1]: mnt-data-supervisor-mounts-backup.mount: Mount process exited, code=exited, status=1/FAILURE
Mar 27 12:15:30 home systemd[1]: Reload failed for Supervisor nfs mount: backup.
Mar 27 12:15:37 home systemd[1]: run-docker-runtime\x2drunc-moby-ff86bbb670854cdfd1f1814e02525b7fd2c179dc9087c182208732cd9b618704-runc.0pBNcO.mount: Deactivated successfully.
Mar 27 12:15:38 home systemd[1]: Reloading Supervisor nfs mount: backup...
Mar 27 12:15:38 home mount[2769226]: mount.nfs: Stale file handle
Mar 27 12:15:38 home systemd[1]: mnt-data-supervisor-mounts-backup.mount: Mount process exited, code=exited, status=1/FAILURE
Mar 27 12:15:38 home systemd[1]: Reload failed for Supervisor nfs mount: backup.

System Health information

System Information

version core-2024.3.2
installation_type Home Assistant OS
dev false
hassio true
docker true
user root
virtualenv false
python_version 3.12.2
os_name Linux
os_version 6.6.20-haos
arch x86_64
timezone America/New_York
config_dir /config
Home Assistant Community Store
GitHub API ok
GitHub Content ok
GitHub Web ok
GitHub API Calls Remaining 4885
Installed Version 1.34.0
Stage running
Available Repositories 1406
Downloaded Repositories 21
Home Assistant Cloud
logged_in true
subscription_expiration December 31, 2017 at 7:00 PM
relayer_connected false
relayer_region null
remote_enabled false
remote_connected false
alexa_enabled false
google_enabled false
remote_server null
certificate_status null
instance_id 6d1891b515664ee79d1010b60c891d36
can_reach_cert_server ok
can_reach_cloud_auth ok
can_reach_cloud ok
Home Assistant Supervisor
host_os Home Assistant OS 12.1
update_channel stable
supervisor_version supervisor-2024.03.0
agent_version 1.6.0
docker_version 24.0.7
disk_total 109.3 GB
disk_used 67.6 GB
healthy true
supported true
board generic-x86-64
supervisor_api ok
version_api ok
installed_addons Studio Code Server (5.15.0), Mosquitto broker (6.4.0), Advanced SSH & Web Terminal (17.2.0), Promtail (2.2.0), Node-RED (17.0.10), Z-Wave JS UI (3.4.1), Github Actions Runner (3), Zigbee2MQTT (1.36.0-1), SQLite Web (4.1.2), Whisper (2.0.0), Piper (1.5.0), Silicon Labs Multiprotocol (2.4.4), Matter Server (5.4.1), openWakeWord (1.10.0), ESPHome (2024.3.1)
Dashboards
dashboards 3
resources 17
views 35
mode yaml
Recorder
oldest_recorder_run March 17, 2024 at 7:48 PM
current_recorder_run March 27, 2024 at 8:23 AM
estimated_db_size 6373.67 MiB
database_engine sqlite
database_version 3.44.2

Supervisor diagnostics

config_entry-hassio-56582080815fc62cb9d95187042d8c10.json

Additional information

No response

AngellusMortis avatar Mar 27 '24 12:03 AngellusMortis

I know I am using Unraid and Unraid has been shown to not have the best performance, but I have 9 machines (7 Linux and 2 Windows) connected to this server via NFS and Home Assistant is the only one that that just breaks regularly so there seems to be something wrong in how HA Supervisor is managing the NFS mount.

Are those other Linux systems also constantly connected?

Occasionally (usually at least once a day), my NFS backup mount will just disconnect, and I will get a repair saying it failed.

I guess this is kinda the root of the problem: The question is why does it disconnect?

Home Assistant is doing a mount manager reload every 15 minutes, which leads to a check. My guess is that this check fails every now and then. And from that fail onwards, the (re-)mount is not possible.

The StackExchange thread mount.nfs: Stale file handle error - cannot umount has some information why remounting fails in this case.

I guess we can't really do something client side. Is the server constantly on?

agners avatar Apr 08 '24 12:04 agners

Yes. All of the other machines are constantly connected and only HA has the issue. Most of them are constantly using the mount as well.

AngellusMortis avatar Apr 08 '24 15:04 AngellusMortis

The mounts are managed by the OS. Here is the fstab for them (on the Linux machines). They basically never disconnect or get stale file handles.

ip:/mnt/user/backup /unraid/backup nfs defaults,timeo=900,retrans=5,_netdev 0 0

AngellusMortis avatar Apr 08 '24 17:04 AngellusMortis

For Home Assistant network storage, the Supervisor creates a systemd mount unit on Home Assistant OS, which in turn calls the operating system mount commands (in your case the unit name is mnt-data-supervisor-mounts-backup.mount). In the end, this is not much different from a fstab entry.

The host logs you shared are from the point when you try to reload the mount using the repair, correct?

Is there maybe some log entry before that, when it potentially breaks? Can you also check the Supervisor logs, if they have something when the potential breakage happens?

I suspect that the 15 minutes mount reload causes havoc somehow, in your case.

agners avatar Apr 09 '24 09:04 agners

There is nothing additional in the host logs. Just says it failed. It looks like the reloading logs from the OP is when triggers the repair saying it failed. Supervisor logs do not seem to go back further than a few minutes, so I doubt I will get the logs from the precise time it fails.

2024-04-12 23:48:45.052 home kernel: audit: type=1334 audit(1712965725.050:486): prog-id=153 op=LOAD
2024-04-12 23:48:45.061 home systemd[1]: Starting Time & Date Service...
2024-04-12 23:48:45.235 home systemd[1]: Started Time & Date Service.
2024-04-12 23:49:15.077 home systemd[1]: systemd-hostnamed.service: Deactivated successfully.
2024-04-12 23:49:15.117 home kernel: audit: type=1334 audit(1712965755.115:487): prog-id=150 op=UNLOAD
2024-04-12 23:49:15.117 home kernel: audit: type=1334 audit(1712965755.115:488): prog-id=149 op=UNLOAD
2024-04-12 23:49:15.117 home kernel: audit: type=1334 audit(1712965755.115:489): prog-id=148 op=UNLOAD
2024-04-12 23:49:15.271 home systemd[1]: systemd-timedated.service: Deactivated successfully.
2024-04-12 23:49:15.279 home kernel: audit: type=1334 audit(1712965755.277:490): prog-id=151 op=UNLOAD
2024-04-12 23:49:15.280 home kernel: audit: type=1334 audit(1712965755.278:491): prog-id=153 op=UNLOAD
2024-04-12 23:49:15.280 home kernel: audit: type=1334 audit(1712965755.278:492): prog-id=152 op=UNLOAD
2024-04-12 23:49:45.326 home systemd[1]: run-docker-runtime\x2drunc-moby-aed93476013bfd982b3b1f514014b9e882e9049b4c6fc64d299ec3dfbf528428-runc.wfFqIe.mount: Deactivated successfully.
2024-04-12 23:51:30.674 home systemd[1]: run-docker-runtime\x2drunc-moby-f49e1580244ed850a42e3254039536671a44479852b57792b29b264a9137a441-runc.Ke9OTz.mount: Deactivated successfully.
2024-04-12 23:54:05.898 home systemd[1]: run-docker-runtime\x2drunc-moby-a83502dd733fdaea40c9a3420078da2be9ec1ac84f17f279f87a3e83e9faaae6-runc.DKvkXi.mount: Deactivated successfully.
2024-04-13 00:23:33.532 home systemd[1]: run-docker-runtime\x2drunc-moby-f49e1580244ed850a42e3254039536671a44479852b57792b29b264a9137a441-runc.Cgon3N.mount: Deactivated successfully.
2024-04-13 00:23:57.576 home systemd[1]: run-docker-runtime\x2drunc-moby-3278db46567e871cfcd6876f170de06748997d57cd5b9d615f687797b363aa87-runc.BpXC5N.mount: Deactivated successfully.
2024-04-13 00:29:38.724 home systemd[1]: run-docker-runtime\x2drunc-moby-3278db46567e871cfcd6876f170de06748997d57cd5b9d615f687797b363aa87-runc.XObDMb.mount: Deactivated successfully.
2024-04-13 00:40:39.058 home systemd[1]: run-docker-runtime\x2drunc-moby-f49e1580244ed850a42e3254039536671a44479852b57792b29b264a9137a441-runc.jGYQVZ.mount: Deactivated successfully.
2024-04-13 00:44:12.925 home systemd[1]: run-docker-runtime\x2drunc-moby-aed93476013bfd982b3b1f514014b9e882e9049b4c6fc64d299ec3dfbf528428-runc.Uz7psY.mount: Deactivated successfully.
2024-04-13 00:53:10.614 home systemd[1]: run-docker-runtime\x2drunc-moby-a83502dd733fdaea40c9a3420078da2be9ec1ac84f17f279f87a3e83e9faaae6-runc.SHnJLU.mount: Deactivated successfully.
2024-04-13 01:09:03.222 home systemd[1]: run-docker-runtime\x2drunc-moby-f3317a65d25dbbf4987cbf55a84b57b1d8ae9c8b8366ce4bb696733ba193809b-runc.q2BxV0.mount: Deactivated successfully.
2024-04-13 01:13:06.464 home systemd[1]: run-docker-runtime\x2drunc-moby-3278db46567e871cfcd6876f170de06748997d57cd5b9d615f687797b363aa87-runc.8DujJZ.mount: Deactivated successfully.
2024-04-13 01:13:10.969 home systemd[1]: run-docker-runtime\x2drunc-moby-781da2ccd41348462c38f5f55cfe364a326f699a56f13c6a3fae50acc15d91e9-runc.oXV3gP.mount: Deactivated successfully.
2024-04-13 01:49:12.468 home systemd[1]: run-docker-runtime\x2drunc-moby-3278db46567e871cfcd6876f170de06748997d57cd5b9d615f687797b363aa87-runc.0OECJK.mount: Deactivated successfully.
2024-04-13 01:55:15.402 home systemd[1]: run-docker-runtime\x2drunc-moby-a83502dd733fdaea40c9a3420078da2be9ec1ac84f17f279f87a3e83e9faaae6-runc.VMm56T.mount: Deactivated successfully.
2024-04-13 01:55:25.340 home kernel: audit: type=1334 audit(1712973325.338:493): prog-id=154 op=LOAD
2024-04-13 01:55:25.340 home kernel: audit: type=1334 audit(1712973325.338:494): prog-id=155 op=LOAD
2024-04-13 01:55:25.340 home kernel: audit: type=1334 audit(1712973325.338:495): prog-id=156 op=LOAD
2024-04-13 01:55:25.346 home systemd[1]: Starting Hostname Service...
2024-04-13 01:55:25.540 home systemd[1]: Started Hostname Service.
2024-04-13 01:55:25.549 home kernel: audit: type=1334 audit(1712973325.547:496): prog-id=157 op=LOAD
2024-04-13 01:55:25.549 home kernel: audit: type=1334 audit(1712973325.547:497): prog-id=158 op=LOAD
2024-04-13 01:55:25.549 home kernel: audit: type=1334 audit(1712973325.547:498): prog-id=159 op=LOAD
2024-04-13 01:55:25.559 home systemd[1]: Starting Time & Date Service...
2024-04-13 01:55:25.759 home systemd[1]: Started Time & Date Service.
2024-04-13 01:55:55.575 home systemd[1]: systemd-hostnamed.service: Deactivated successfully.
2024-04-13 01:55:55.604 home kernel: audit: type=1334 audit(1712973355.602:499): prog-id=156 op=UNLOAD
2024-04-13 01:55:55.604 home kernel: audit: type=1334 audit(1712973355.602:500): prog-id=155 op=UNLOAD
2024-04-13 01:55:55.604 home kernel: audit: type=1334 audit(1712973355.602:501): prog-id=154 op=UNLOAD
2024-04-13 01:55:55.793 home systemd[1]: systemd-timedated.service: Deactivated successfully.
2024-04-13 01:55:55.802 home kernel: audit: type=1334 audit(1712973355.800:502): prog-id=159 op=UNLOAD
2024-04-13 01:55:55.802 home kernel: audit: type=1334 audit(1712973355.800:503): prog-id=158 op=UNLOAD
2024-04-13 01:55:55.802 home kernel: audit: type=1334 audit(1712973355.800:504): prog-id=157 op=UNLOAD
2024-04-13 02:05:07.417 home systemd[1]: run-docker-runtime\x2drunc-moby-f3317a65d25dbbf4987cbf55a84b57b1d8ae9c8b8366ce4bb696733ba193809b-runc.g2zHGn.mount: Deactivated successfully.
2024-04-13 02:10:12.091 home systemd[1]: run-docker-runtime\x2drunc-moby-859d4c404a41d70d1540670bece13ac8099a0e7bd80d1923b87eb9ec4557972b-runc.6Ia0cK.mount: Deactivated successfully.
2024-04-13 02:16:12.527 home systemd[1]: run-docker-runtime\x2drunc-moby-859d4c404a41d70d1540670bece13ac8099a0e7bd80d1923b87eb9ec4557972b-runc.R8UVOV.mount: Deactivated successfully.
2024-04-13 02:22:04.309 home systemd[1]: Reloading Supervisor nfs mount: backup...
2024-04-13 02:22:04.317 home mount[510850]: mount.nfs: Stale file handle for (null) on /mnt/data/supervisor/mounts/backup
2024-04-13 02:22:04.318 home systemd[1]: mnt-data-supervisor-mounts-backup.mount: Mount process exited, code=exited, status=1/FAILURE
2024-04-13 02:22:04.318 home systemd[1]: Reload failed for Supervisor nfs mount: backup.

AngellusMortis avatar Apr 13 '24 05:04 AngellusMortis

There hasn't been any activity on this issue recently. Due to the high number of incoming GitHub notifications, we have to clean some of the old issues, as many of them have already been resolved with the latest updates. Please make sure to update to the latest version and check if that solves the issue. Let us know if that works for you by adding a comment 👍 This issue has now been marked as stale and will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar May 13 '24 06:05 github-actions[bot]

Not stale

AngellusMortis avatar May 13 '24 12:05 AngellusMortis

There hasn't been any activity on this issue recently. Due to the high number of incoming GitHub notifications, we have to clean some of the old issues, as many of them have already been resolved with the latest updates. Please make sure to update to the latest version and check if that solves the issue. Let us know if that works for you by adding a comment 👍 This issue has now been marked as stale and will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Jun 19 '24 21:06 github-actions[bot]