Detected data corruption while saving blob - hash mismatch
Output of restic version
restic 0.18.0 compiled with go1.24.1 on linux/amd64
What backend/service did you use to store the repository?
sftp restic is running in a VM on Debian
Problem description / Steps to reproduce
restic -r sftp:user.server.com: backup /srv/mergerfs/media --exclude-file=/excludes.txt --password-file /pwd.txt
- source: a local media share/directory using mergerfs, consisting of a NMVe and a SSD disk, both using EXT4 on LUKS.
- target: a sftp service in the cloud
- running restic in a bashscript called by crontab
- media dir about 2.75 TB of mainly videos
Expected behavior
restic completes without an error
Actual behavior
restic stops and returns:
Fatal: unable to save snapshot: failed to save blob from file "/srv/mergerfs/media/abc.mgp": D**etected data corruption while saving blob** c812c50912fbef3addd75979c4b17969a06eaecb995f1417af0199fe2209c97f: **hash mismatch**
Corrupted blobs are either caused by hardware issues or software bugs. Please open an issue at https://github.com/restic/restic/issues/new/choose for further troubleshooting.
Do you have any idea what may have caused this?
- updated to 0.18.0 just before this issue happend.
- had "issues" with very long runtimes for no obvisous reason before -> reason why I updated from 0.14.0 (official debian repo) to 0.18.0 from github. But no errors or failures before.
- Other restic jobs on the same machine reading from the same hardware (SSD) work fine. However, /media is the only one using mergerFS.
Did restic help you today? Did it make you happy in any way?
- It works well for smaller file sizes and smaller number of files.
- It does not seem to work efficiently on larger repos and larger files. The behaviour seems to be inconsistent in these situations: Without any changes to the source file base, it completes within seconds. With only small changes like a couple of MB, it takes multiple hours to complete.
- I like the integrated dedup and the encryption.
- I'm missing an admin UI specially to visualize (a) the state and content of a repo and (b) what acutally is happening in restic during a run
Just started a rerun. Same error happend again, however on a different source file. Also checked the sftp backend: No storage issues as far as i can see. Starting a different restic job with different source files but the same target server (different repo) works.
No indication, but something to keep in mind for further investigations:
- mergerFS spans over a NVME and a SSD (temps no issue in both cases)
- In both cases above data came from the NVME
- No other writing processes happen in parallel to restic reading from the mergerFS
- Both, the NVME and the SSD are almost full. If i correctly understand, this should have no effect as restic only writes to ~/.cache/restic/ and /tmp on the OS disk (there's plenty of free space available)?
I run sha256sum a couple of times on the file restic complained about above and it returned the same hash each time.
Now, looking a bit closer at restic and mergerFS: It seems, mergerFS is using virtual inodes, restic is expecting static inodes. Is this correct? As if yes, couldn't this explain the error message? Meaning, the hashes do not match because restic, relaying on a static inode, is picking a "wrong" inode and therefore getting different content leading to a different hash?
I'm going to try restic backup --ignore-inode
--ignore-inode seems to do the trick. Ran the job two times now, no issues, returned within seconds, as expected. I will do some more tests. However, given the disadvantages of ignoring inodes regarding dedup, cpu load and runtime, I probably will rather change the restic job to backup the disks directly.
Ummm.. no. Problem still there, when i run
- without --ignore-inode
- directly on the disks, not using MergerFS
i get the same error and again on a different file.
restic cache --cleanup then rerun same error, different file.
Next test: Running two separate restic jobs for each of the two drives, same parameters, same target repo, just different sources. The SSD runs without problems, 1 TB done in 1h (approx). The NVMe runs into the same error as above, this time rather at the end of the task (somewhere between 80-90% of 1.75 TB).
All issues so far have always been related to the NVMe disk. However I got the impression this could also come from a failure situation that happened earlier with incorrect dynamic inodes.
Next stop: delete the whole repo and start from scratch.
What's the max chunk size restic loads into /tmp? Is there a hard limit? Or is it relative to the max file size in the backup?
--
Doubled /tmp, monitored it during backup run, never came near its max size. Hoewever, again, run into the same error, same NVMe, different file. This time at about 50% of the backup completed.
Sounds somewhat similar: https://github.com/restic/restic/issues/5279
Deleted repo, init new repo, full first backup run in progress. ETA 60 - 70h for 2.75 TB. So, will take some time if no error occurs.
on a different source file
good reason to run mprime torture test to make sure your CPU and RAM aren't broken
I did.
Atm nothing points towards hardware. There are weak indications towards the software, there is a clear correlation with switching from 0.14 to 0.18. However, correlation does not mean root cause. Sill, it could be an incompatibility in restic between versions 0.14 and 0.18. Also, the inodes handling seem to play a role in this. So maybe it's a more complex error situation.
A complete new backup on a new repo is running atm, 38% done an no errors so far.
Recap
- I deleted the old repo which was created with 0.14
- I set up a new repo with 0.18
- I did a first full backup run
- I did not use ignore-inode
- i did not backup via MergerFS but directly on the disks via their guid
Result
- The run completed without errors
- I did a couple of reruns without any data changed on the source drives and the run completed fast and without error.
Conclusion
- No root cause found so far
- No indication of a hardware problem
- Correlation with update from 0.14 to 0.18
- My current interpretation: repo created in 0.14 either had an error not detected by 0.14 but by 0.18 or 0.18 isn't fully compatible with 0.14. Since i have a couple more but far smaller repos created with 0.14 and now runing with 0.18 that work without problems, repo and/or files size might play a role here aswell.
The Corrupted blobs are either caused by hardware issues or software bugs. error CANNOT in any way be affected by the state of the existing repository. This error message didn't exist in older restic versions (that's the correlation with the version upgrade). What restic does is to compress+encrypt a data chunk in memory and decrypt+decompress it afterwards. This whole operation happens in memory, that is it cannot be directly affected by the used filesystem.
As the error seems to randomly occur and disappear, this usually indicates some (minimal) hardware issue. The filesystem and the options passed to restic play a role here insofar as they change to load on the hardware. Problems may occur during low or high load or only when switching between them.
Please run check --read-data to verify the repository integrity (warning: will download the whole repository).