restic icon indicating copy to clipboard operation
restic copied to clipboard

Detected data corruption while saving blob - hash mismatch

Open AnAnalogGuy opened this issue 7 months ago • 13 comments

Output of restic version

restic 0.18.0 compiled with go1.24.1 on linux/amd64

What backend/service did you use to store the repository?

sftp restic is running in a VM on Debian

Problem description / Steps to reproduce

restic -r sftp:user.server.com: backup /srv/mergerfs/media --exclude-file=/excludes.txt --password-file /pwd.txt

  • source: a local media share/directory using mergerfs, consisting of a NMVe and a SSD disk, both using EXT4 on LUKS.
  • target: a sftp service in the cloud
  • running restic in a bashscript called by crontab
  • media dir about 2.75 TB of mainly videos

Expected behavior

restic completes without an error

Actual behavior

restic stops and returns:

Fatal: unable to save snapshot: failed to save blob from file "/srv/mergerfs/media/abc.mgp": D**etected data corruption while saving blob** c812c50912fbef3addd75979c4b17969a06eaecb995f1417af0199fe2209c97f: **hash mismatch**
Corrupted blobs are either caused by hardware issues or software bugs. Please open an issue at https://github.com/restic/restic/issues/new/choose for further troubleshooting.

Do you have any idea what may have caused this?

  • updated to 0.18.0 just before this issue happend.
  • had "issues" with very long runtimes for no obvisous reason before -> reason why I updated from 0.14.0 (official debian repo) to 0.18.0 from github. But no errors or failures before.
  • Other restic jobs on the same machine reading from the same hardware (SSD) work fine. However, /media is the only one using mergerFS.

Did restic help you today? Did it make you happy in any way?

  • It works well for smaller file sizes and smaller number of files.
  • It does not seem to work efficiently on larger repos and larger files. The behaviour seems to be inconsistent in these situations: Without any changes to the source file base, it completes within seconds. With only small changes like a couple of MB, it takes multiple hours to complete.
  • I like the integrated dedup and the encryption.
  • I'm missing an admin UI specially to visualize (a) the state and content of a repo and (b) what acutally is happening in restic during a run

AnAnalogGuy avatar Jun 07 '25 12:06 AnAnalogGuy

Just started a rerun. Same error happend again, however on a different source file. Also checked the sftp backend: No storage issues as far as i can see. Starting a different restic job with different source files but the same target server (different repo) works.

AnAnalogGuy avatar Jun 07 '25 12:06 AnAnalogGuy

No indication, but something to keep in mind for further investigations:

  • mergerFS spans over a NVME and a SSD (temps no issue in both cases)
  • In both cases above data came from the NVME
  • No other writing processes happen in parallel to restic reading from the mergerFS
  • Both, the NVME and the SSD are almost full. If i correctly understand, this should have no effect as restic only writes to ~/.cache/restic/ and /tmp on the OS disk (there's plenty of free space available)?

AnAnalogGuy avatar Jun 07 '25 12:06 AnAnalogGuy

I run sha256sum a couple of times on the file restic complained about above and it returned the same hash each time.

Now, looking a bit closer at restic and mergerFS: It seems, mergerFS is using virtual inodes, restic is expecting static inodes. Is this correct? As if yes, couldn't this explain the error message? Meaning, the hashes do not match because restic, relaying on a static inode, is picking a "wrong" inode and therefore getting different content leading to a different hash?

I'm going to try restic backup --ignore-inode

AnAnalogGuy avatar Jun 07 '25 12:06 AnAnalogGuy

--ignore-inode seems to do the trick. Ran the job two times now, no issues, returned within seconds, as expected. I will do some more tests. However, given the disadvantages of ignoring inodes regarding dedup, cpu load and runtime, I probably will rather change the restic job to backup the disks directly.

AnAnalogGuy avatar Jun 07 '25 13:06 AnAnalogGuy

Ummm.. no. Problem still there, when i run

  • without --ignore-inode
  • directly on the disks, not using MergerFS

i get the same error and again on a different file.

AnAnalogGuy avatar Jun 07 '25 14:06 AnAnalogGuy

restic cache --cleanup then rerun same error, different file.

AnAnalogGuy avatar Jun 07 '25 14:06 AnAnalogGuy

Next test: Running two separate restic jobs for each of the two drives, same parameters, same target repo, just different sources. The SSD runs without problems, 1 TB done in 1h (approx). The NVMe runs into the same error as above, this time rather at the end of the task (somewhere between 80-90% of 1.75 TB).

All issues so far have always been related to the NVMe disk. However I got the impression this could also come from a failure situation that happened earlier with incorrect dynamic inodes.

Next stop: delete the whole repo and start from scratch.

AnAnalogGuy avatar Jun 07 '25 16:06 AnAnalogGuy

What's the max chunk size restic loads into /tmp? Is there a hard limit? Or is it relative to the max file size in the backup?

--

Doubled /tmp, monitored it during backup run, never came near its max size. Hoewever, again, run into the same error, same NVMe, different file. This time at about 50% of the backup completed.

AnAnalogGuy avatar Jun 07 '25 18:06 AnAnalogGuy

Sounds somewhat similar: https://github.com/restic/restic/issues/5279

AnAnalogGuy avatar Jun 07 '25 22:06 AnAnalogGuy

Deleted repo, init new repo, full first backup run in progress. ETA 60 - 70h for 2.75 TB. So, will take some time if no error occurs.

AnAnalogGuy avatar Jun 08 '25 12:06 AnAnalogGuy

on a different source file

good reason to run mprime torture test to make sure your CPU and RAM aren't broken

andreymal avatar Jun 09 '25 09:06 andreymal

I did.

Atm nothing points towards hardware. There are weak indications towards the software, there is a clear correlation with switching from 0.14 to 0.18. However, correlation does not mean root cause. Sill, it could be an incompatibility in restic between versions 0.14 and 0.18. Also, the inodes handling seem to play a role in this. So maybe it's a more complex error situation.

A complete new backup on a new repo is running atm, 38% done an no errors so far.

AnAnalogGuy avatar Jun 09 '25 12:06 AnAnalogGuy

Recap

  • I deleted the old repo which was created with 0.14
  • I set up a new repo with 0.18
  • I did a first full backup run
  • I did not use ignore-inode
  • i did not backup via MergerFS but directly on the disks via their guid

Result

  • The run completed without errors
  • I did a couple of reruns without any data changed on the source drives and the run completed fast and without error.

Conclusion

  • No root cause found so far
  • No indication of a hardware problem
  • Correlation with update from 0.14 to 0.18
  • My current interpretation: repo created in 0.14 either had an error not detected by 0.14 but by 0.18 or 0.18 isn't fully compatible with 0.14. Since i have a couple more but far smaller repos created with 0.14 and now runing with 0.18 that work without problems, repo and/or files size might play a role here aswell.

AnAnalogGuy avatar Jun 11 '25 13:06 AnAnalogGuy

The Corrupted blobs are either caused by hardware issues or software bugs. error CANNOT in any way be affected by the state of the existing repository. This error message didn't exist in older restic versions (that's the correlation with the version upgrade). What restic does is to compress+encrypt a data chunk in memory and decrypt+decompress it afterwards. This whole operation happens in memory, that is it cannot be directly affected by the used filesystem.

As the error seems to randomly occur and disappear, this usually indicates some (minimal) hardware issue. The filesystem and the options passed to restic play a role here insofar as they change to load on the hardware. Problems may occur during low or high load or only when switching between them.

Please run check --read-data to verify the repository integrity (warning: will download the whole repository).

MichaelEischer avatar Sep 07 '25 13:09 MichaelEischer