python-matter-server icon indicating copy to clipboard operation
python-matter-server copied to clipboard

CHIP Persistent data storage writes are not power-loss safe

Open kepstin opened this issue 1 year ago • 12 comments

In this discord thread: https://discord.com/channels/330944238910963714/1311124568081174628 a user reported an issue where following a power loss, all of their matter devices disappeared. From the Matter server logs:

2024-11-26 20:12:41.236 (MainThread) ERROR [chip.storage] Expecting ',' delimiter: line 29 column 111 (char 4111)
2024-11-26 20:12:41.236 (MainThread) CRITICAL [chip.storage] Could not load configuration from /data/chip.json - resetting configuration...

It looks like the configuration file was present, but had invalid data in it. The user was able to restore a backup correct the issue, but when I looked at the code to persist the storage: (edit: this code doesn't seem to be responsible for the user's issue, which happened in the Matter SDK code - but this code is incorrect.)

https://github.com/home-assistant-libs/python-matter-server/blob/a55bb8b0dc838cfee2eb0dd70f18655e26540efb/matter_server/server/storage.py#L149-L163

I noticed that the logic here is not power loss safe, and might result in a partially written or missing file for unlucky users.

The current logic does:

  • Rename chip.json to chip.json.backup. Following this, if power is lost, there may be no chip.json file present at all.
  • Write a new chip.json file. If the write is interrupted, or power is lost before the system flushes the write cache, a partially written or corrupt configuration file may be present on reboot, resulting in errors like this user saw.

I'd recommend the following logic instead:

  • (optional) Copy chip.json to chip.json.backup, leaving the existing file in place. (Ideally, run fsync on the newly written file to ensure the backup is updated before the original file gets replaced.)
  • Write the new data to a temporary file, like chip.json.new.
  • Run fsync (or fdatasync) on the newly written file to ensure the file contents are persisted on the disk. (see the notes on https://docs.python.org/3/library/os.html#os.fsync regarding flushing the python buffers). This is needed to ensure ordering - making sure that the file contents will be written before the file gets renamed.
  • Rename the temporary file to the final filename, atomically replacing the existing file.
  • (optional) Run fsync on the renamed file to ensure the rename is committed to storage. (This is only needed if data stored elsewhere - e.g. in a sqlite database - might depend on this data having been persisted)

kepstin avatar Nov 27 '24 15:11 kepstin

… Actually I think I'm looking at the wrong code here - the error messages don't match up. Is the code in matter-server/server/storage.py used? is it for something else?

kepstin avatar Nov 27 '24 16:11 kepstin

Looking closer, the actual code responsible for this might be in the Matter SDK itself, here:

https://github.com/project-chip/connectedhomeip/blob/e98f2b239aae657a10043bad9aa61f4717d059be/src/controller/python/chip/storage/init.py#L164-L183

Unfortunately, this code has the same issue, where a power loss during write (prior to the fsync) can result in an incompletely written file which will be corrupt on restart. It's arguably worse in a different way, since it doesn't remove/truncate the file before doing a write, so on a power loss you can get a mix of the old and new files present.

I've opened https://github.com/project-chip/connectedhomeip/issues/36654 regarding the Matter SDK code.

kepstin avatar Nov 27 '24 16:11 kepstin

The only way to get that resolved is create a PR on the matter SDK repo. Nobody is going to pick up that issue report as the python bindings are never meant to be used the way we do. They're only created for the testing framework. So if you wnat this fixed, you'll need to create a PR there or wait for one of us to have some spare time ;-)

Even though this could be improved I still think you just need to make backups of your server.

marcelveldt avatar Dec 04 '24 08:12 marcelveldt

Yes, backups are always needed, but if it can be easily fixed/improved, why not?

nerijus avatar Dec 04 '24 08:12 nerijus

Files getting corrupted on hard power off is entirely preventable in software (assuming you're not running on broken hardware). We should be expecting Home Assistant will be able to automatically recover and keep working when booted up after a hard power off. The current situation, where it can hit a corrupt data file and drop all your Matter devices with nothing but a log message is unacceptable IMO.

Fixing this will prevent an entire class of "oh no, there was a power outage last night and none of my automations on matter devices worked this morning - what are you saying i need to restore a backup!?" issues.

Unfortunately while I have the ability to write and contribute the fix to the Matter SDK side, I'm not willing to sign their CLA at the moment.

kepstin avatar Dec 16 '24 16:12 kepstin

I'm not saying we should not fix this (we're looking at it) but that besides making the software more forgiving you should always have backups as well. Sorry if that didn't come across. English is not my main language so my Dutch-style English is not always understood haha.

We are a super limited team (basically its just me and Stefan helps me out now and then) and low on contributions so we don't have lots of time to spend coding, its a constant balance of what to pickup and our time has very much been consumed by all these issues with Thread and bugs with TREL.

I'll do my best to create at least a temporary patch for this (as we now have seen a couple of reports already about this) before/on monday. Thanks!

marcelveldt avatar Dec 20 '24 17:12 marcelveldt

I think I just hit this issue myself. Does the Docker container contain this fix?

I'm currently running (f07a9b2ff8bafeaf1be20ded7f4c2ff3452fe7b87d9c22aa72beaf980978a07b):

ghcr.io/home-assistant-libs/python-matter-server     stable             f07a9b2ff8ba   2 days ago      448MB

aidanharris avatar Dec 26 '24 07:12 aidanharris

docker image inspect f07a9b2ff8bafeaf1be20ded7f4c2ff3452fe7b87d9c22aa72beaf980978a07b

        "RepoDigests": [
            "ghcr.io/home-assistant-libs/python-matter-server@sha256:1e371d6936c179fec67896180a8697448c2ccec628078f709bb4b271b040701b"
        ],

It seems up-to-date: https://github.com/home-assistant-libs/python-matter-server/actions/runs/12470272943/job/34805323062

I'm not a fan of how difficult Github makes it to find the built images with their hash, etc. Should I be looking somewhere else?

All my Matter devices stopped responding for some reason. I have it working again now though. I still have the backed up data folder if it's useful. I'll open another bug if it happens again.

aidanharris avatar Dec 26 '24 07:12 aidanharris

I think I just hit this issue myself. Does the Docker container contain this fix?

Unfortunately, the fix that has been committed isn't complete - it improves the situation, but there's an important data file written by the Matter SDK itself which also has this issue: https://github.com/project-chip/connectedhomeip/issues/36654 (in logs I've seen for other users hitting this problem, the Matter SDK file is the one that actually got corrupted).

kepstin avatar Dec 26 '24 13:12 kepstin

I think there's still a bug in the server. It's not making the JSON file, I had to touch an empty file:

Traceback (most recent call last):
  File "/usr/local/bin/matter-server", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.12/site-packages/matter_server/server/__main__.py", line 239, in main
    run(server.start(), shutdown_callback=handle_stop)
  File "/usr/local/lib/python3.12/site-packages/aiorun.py", line 370, in run
    raise pending_exception_to_raise
  File "/usr/local/lib/python3.12/site-packages/aiorun.py", line 270, in run
    loop.run_until_complete(shutdown_callback(loop))
  File "/usr/local/lib/python3.12/asyncio/base_events.py", line 686, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/matter_server/server/__main__.py", line 236, in handle_stop
    await server.stop()
  File "/usr/local/lib/python3.12/site-packages/matter_server/server/server.py", line 225, in stop
    await self.storage.stop()
  File "/usr/local/lib/python3.12/site-packages/matter_server/server/storage.py", line 55, in stop
    await self.async_save()
  File "/usr/local/lib/python3.12/site-packages/matter_server/server/storage.py", line 169, in async_save
    await self.server.loop.run_in_executor(None, do_save)
  File "/usr/local/lib/python3.12/concurrent/futures/thread.py", line 59, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/matter_server/server/storage.py", line 159, in do_save
    shutil.copy(self.filename, self.filename_backup)
  File "/usr/local/lib/python3.12/shutil.py", line 435, in copy
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "/usr/local/lib/python3.12/shutil.py", line 260, in copyfile
    with open(src, 'rb') as fsrc:
         ^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/data/15481557425402580959.json'
# touch matter-data/15481557425402580959.json
# Now it persists properly when I `docker stop/start`

aidanharris avatar Dec 31 '24 10:12 aidanharris

@aidanharris please do not pollute this issue report with an unrelated report. The issue you mention has been fixed this morning and a new release will be issued soon.

marcelveldt avatar Dec 31 '24 10:12 marcelveldt

@aidanharris please do not pollute this issue report with an unrelated report. The issue you mention has been fixed this morning and a new release will be issued soon.

Sorry, about that. I did mark it as resolved (I should have checked first) but that's too late for people's email notifications.

aidanharris avatar Dec 31 '24 10:12 aidanharris

Lets revisit this when we migrated to matter.js. All storage interactions there use writes and atomic renames

Apollon77 avatar Aug 28 '25 08:08 Apollon77