supervisor icon indicating copy to clipboard operation
supervisor copied to clipboard

Network Storage: Backups - UX overhaul to address multiple issues

Open jmealo opened this issue 1 year ago • 8 comments

The problem

There are a number of related issues with network storage backups that can cause data loss and inflict great pain on users. This issue will require multiple fixes and improvements to fully address.

Technical issues:

  • Not reconnecting after a power loss, where home assistant boots prior to network server.
  • No backups being shown when the network mount fails to re-establish.
  • No easy way to rectify the failure condition from inside of HA, especially if running inside of a VM.

UX issues:

  • Backups aren’t occurring as configured.
  • The errors and notices that appear are not helpful.
  • Past backups “disappear” when the network mount cannot reconnect.
  • Not enough contextual information in the backups page when there is a network issue.
  • No backup/fallback (local) backups when the network mount fails.

Data loss issues

  • A previously "correctly" configured backup setup stops working without any configuration changes being made, a power cycle is enough.

Opportunities for improvement

  • Offer Cloud-based backup options
  • Offer retries/async uploading of backups to remote storage
  • Offer a fallback/offline storage location in the event that the main/remote storage location fails

GitHub issues:

  • https://github.com/home-assistant/core/issues/103907
  • https://github.com/home-assistant/core/issues/103796
  • https://github.com/home-assistant/core/issues/103652
  • https://github.com/home-assistant/core/issues/102009
  • https://github.com/home-assistant/core/issues/100560
  • https://github.com/home-assistant/core/issues/99551
  • https://github.com/home-assistant/supervisor/issues/4662
  • https://github.com/home-assistant/supervisor/issues/4643
  • https://github.com/home-assistant/supervisor/issues/4577
  • https://github.com/home-assistant/supervisor/issues/4473
  • https://github.com/home-assistant/supervisor/issues/4357
  • https://github.com/home-assistant/supervisor/issues/4358
  • https://github.com/home-assistant/supervisor/issues/4789

What version of Home Assistant Core has the issue?

All

What was the last working version of Home Assistant Core?

Never

What type of installation are you running?

Home Assistant OS

Integration causing the issue

No response

Link to integration documentation on our website

No response

Diagnostics information

No response

Example YAML snippet

No response

Anything in the logs that might be useful for us?

No response

Additional information

No response

jmealo avatar Dec 27 '23 01:12 jmealo

https://github.com/home-assistant/supervisor/issues/4358#issuecomment-1818009197

Since this bug basically tags many of the open issues around the root issue and has been around for months with zero indication of any priority being applied to a root cause fix the above link is the way to delete the local files over SSH to empty the directory so it can mount again. Hopefully save others from additional searching. Kudos to the original poster as I always wondered how to get such access on the colored HA boxes remotely.

haywiremk avatar Jan 09 '24 20:01 haywiremk

home-assistant/supervisor#4358 (comment)

Since this bug basically tags many of the open issues around the root issue and has been around for months with zero indication of any priority being applied to a root cause fix the above link is the way to delete the local files over SSH to empty the directory so it can mount again. Hopefully save others from additional searching. Kudos to the original poster as I always wondered how to get such access on the colored HA boxes remotely.

Thanks for posting here. I don't know if there's a one-size-fits-all workaround. I think the exact steps could differ based on where and how you run HA.

I can get steps to workaround with KVM on Ubuntu.

I'm wondering if they have very low feature usage for this and aren't allocating effort based on that?

I offered to help on Discord if I could get someone in core to approve/collaborate on a solution. I didn't hear anything so I tagged everyone who touched this code. Not sure if that's considered poor etiquette in this community.

I'm very curious to how code ownership works in these repos as I haven't found anyone to collaborate on a fix despite my best efforts to do so.

jmealo avatar Jan 09 '24 22:01 jmealo

I've the same issue as mentioned in https://github.com/home-assistant/supervisor/issues/4358

I run HA on a raspberry PI 4 and unfortunately I get stuck trying to get the solution in comment https://github.com/home-assistant/supervisor/issues/4358#issuecomment-1818009197 to work.

I would be very grateful if the HA core team would take some time to get ride of the backup bugs.

IvovanWilligen avatar Jan 10 '24 21:01 IvovanWilligen

Here's a proposed fix that I believe will address these issues: https://github.com/home-assistant/supervisor/issues/4856

jmealo avatar Jan 30 '24 20:01 jmealo

Architecture discussion on a proposed fixed to the Network Storage issues: https://github.com/home-assistant/architecture/discussions/1033

jmealo avatar Jan 30 '24 21:01 jmealo

@jmealo first of all, thanks for collecting all these issues and taking these notes :pray:

The things are a bit sprinkled all over the place now :sweat_smile: The network mount is a Supervisor feature exclusively. So this discussion belongs into the Supervisor repository. As a first step I've moved this issue which used to be in the Core repository over here into the Supervisor repository.

Architecture discussion on a proposed fixed to the Network Storage issues: https://github.com/home-assistant/architecture/discussions/1033

Also this would probably more belong in here. E.g. the overall design of the mount feature was not discussed in the architecture repository. Most discussions were in #2564, and some discussion are not captured on GitHub as they happened on Discord or other places.

I'd suggest to use this issue tracker, specifically this very issue, to further discuss how we proceed with network storage.

agners avatar Feb 01 '24 13:02 agners

A bit of background: The network storage feature makes use of systemd mounts. Essentially, the Supervisor instructs systemd running on the operating system to create mounts using D-Bus. The mounts are not persisted on the OS side. This means on reboot the Supervisor instructs systemd to recreate those mounts.

Furthermore, we use a 2 stage system: We mount a network storage internally to a common place, and then bind mount it to the actual place.

It seems we have life cycle problems, especially around appearing/disappearing network storage systems. Technically, we should be able to get notified about a failing mount from systemd via D-Bus. However, some of these cases are just not captured even by the OS (e.g. when a NAS disappears, and nothing is being written to it, then the system might just not notice... until something is actually getting written to it! The question becomes what happens in this case? I guess whoever writes at that point will get an error on his write system call. What I wonder is if the systemd mount unit also fails :thinking: This needs a bit of investigation).

Conceptually, I'd say the system should behave as follows: a) Supervisor should notify the user about any failed network mount. This can be at startup, or at whatever point this might happen. The repair then should reliably mount the storage again. b) If a backup got triggered with a target location which is supposed to be a network mount (but failed to mount), the user should be notified. It is a bit a debate if we should still create a backup then, just on the local storage :thinking: Having a backup is better then none. On the other hand, this has a huge potential of filling the disk, obviously.

For a), I think there is mainly one issue, which is https://github.com/home-assistant/supervisor/issues/4358.

For b), we probably should define what behavior we exactly want, and implement this accordingly.

I also think that some issues of the ones above are probably no longer valid. E.g. 2024.01.0 improved backup error handling, or 2023.12.0 fixed https://github.com/home-assistant/supervisor/pull/4733 which was a problem during unmounting of network storage.

What would be helpful if we can collect the issues which have the same underlying problem. Also it would be nice to have step by step instructions how to reproduce those underlying problems, so we can reproduce them and work on the fix.

agners avatar Feb 01 '24 14:02 agners

@agners Thanks for the thoughtful reply. We agree on the "non-empty directory mount issue" being the primary issue at play here.

Where would be the appropriate place to put automated tests where we can set up a flaky NFS and CIFS test suite? I could start looking into that so we can better understand the issues.

jmealo avatar Feb 01 '24 21:02 jmealo

There hasn't been any activity on this issue recently. Due to the high number of incoming GitHub notifications, we have to clean some of the old issues, as many of them have already been resolved with the latest updates. Please make sure to update to the latest version and check if that solves the issue. Let us know if that works for you by adding a comment 👍 This issue has now been marked as stale and will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Mar 02 '24 22:03 github-actions[bot]

Why is this closed? It should be adressed properly! It's still THE problem why I don't have off device backups at the moment.

IvovanWilligen avatar Mar 10 '24 06:03 IvovanWilligen

Why is this closed?

Because it went stale, just read the comment of the github-actions bot.

It should be adressed properly! It's still THE problem why I don't have off device backups at the moment.

This issue is a collection of issues (which is probably is a bit problematic in itself :sweat: ). What exact problem are you still experience?

agners avatar Mar 11 '24 09:03 agners