`borg extract` should retry failed files (up to X times) instead of failing the entire job
When a file fails checksum verification during borg extract many times this is simply due to a network error, and can be fixed by retrying the file. When I'm restoring 25 TB of files, I don't want one random failure to force me to restart the entire job, especially since there is no deduplication mechanism (ala rsync) in borg extract.
To me, this is a critical issue, because borg is unable to perform the primary job that I expect of it: restoring my files from backup. Instead I have to do a tedious workaround of getting a list of all files that need restored, and doing them one directory at a time, when there are hundreds of directories.
Please post the traceback and the borg version(s), so we can exactly see what failed.
Also I don't quite understand how you come from "when a file fails checksum verification MANY times" to "can be fixed by retrying the file".
To me that rather sounds like: if it fails many times, it likely fails always and retrying is pointless.
Also, it is not really clear from the given facts that this is a network error, it could be also corruption in the repository due to a fs or media error.
But of course I see that failing the whole extraction isn't good, especially if the extraction volume is huge. To see how / where exactly it fails, we need the traceback.
About "no deduplication mechanism": this isn't easy for the general case, but borg2 / master branch now at least skips re-extracting files if a local file is present and the metadata matches perfectly (size and mtime, IIRC), which is already quite good for continuing an interrupted restore run.
If it is precisely one or a few files that have issues, guess you could also try to exclude them (try that first with a smaller archive, so you'll see how to get a successful exclusion).
Also find out and fix the root cause for that issue.
I suppose I phrased that poorly. The same file is not failing each time. It seems to randomly fail after some large amount of data (about 100GB on average), but it's a different file each time. Any given file will succeed if retried.
This is on borg 1.4.1, both the client and server, on Linux. Neither machine is overclocked in any way, though I suppose that doesn't rule out some sort of memory issue causing intermittent read/write corruption (both disk arrays are on mdraid RAID5, and show no issues on mdstat or smartctl). I will post a full traceback next time I encounter it. I've been having better success doing the restore in smaller batches (one directory at a time), though it's tedious.
Sure enough, memory issues do appear to be responsible for the checksum failures. That being said, I do still think a feature such as this would be very helpful for large restore jobs.
Edit: I'll also note that I found a good workaround using existing features, which is to use borg mount to mount the archive, then use rsync for resumable restoring.
Edit 2: That didn't work. It turns out there are files in the repo that are actually corrupted, and when rsync reaches these files, borg's mount disconnects. I suppose the next step is attempting to borg check --repair and make note of which files were corrupted so I can manually restore them. 🤞 This whole process is quite unfriendly to the user though. In my opinion it would be much less frustrating if borg extract would first retry a corrupted file (maybe 3 times by default), and if it continues failing, skip it, continue on, and report all failures at the end of the extract.
I am never been successful in restoring a very large file (~200GB). I tried both "borg extract" and "borg mount"+rsync.
"borg extract" gives this backtrace (at random times) and starts from beginning on every re-attempt. This error occurs only during real extraction and not during dry-run (with -n option). So, I suspected some hard drive issue initially, but there is no kernel disconnect error or failure seen in the mounted drive, and drive continues to work normally. Can this be a network checksum error as describe in this issue?
borg extract --progress ::borgbackup-2025-08-14T02:30:47 source/data/Documents/Common/InstantUpload/sdd6.bkp
Local Exceptionng: source/data/Documents/Common/InstantUpload/sdd6.bkp Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/borg/archiver.py", line 5391, in main
exit_code = archiver.run(args)
File "/usr/lib/python3/dist-packages/borg/archiver.py", line 5309, in run
rc = func(args)
File "/usr/lib/python3/dist-packages/borg/archiver.py", line 191, in wrapper
return method(self, args, repository=repository, **kwargs)
File "/usr/lib/python3/dist-packages/borg/archiver.py", line 206, in wrapper
return method(self, args, repository=repository, manifest=manifest, key=key, archive=archive, **kwargs)
File "/usr/lib/python3/dist-packages/borg/archiver.py", line 927, in do_extract
archive.extract_item(item, stdout=stdout, sparse=sparse, hardlink_masters=hardlink_masters,
~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
stripped_components=strip_components, original_path=orig_path, pi=pi)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/borg/archive.py", line 816, in extract_item
for data in self.pipeline.fetch_many(ids, is_preloaded=True):
~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/borg/archive.py", line 323, in fetch_many
yield self.key.decrypt(id_, data)
~~~~~~~~~~~~~~~~^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/borg/crypto/key.py", line 461, in decrypt
payload = self.cipher.decrypt(data)
File "src/borg/crypto/low_level.pyx", line 289, in borg.crypto.low_level.AES256_CTR_BASE.decrypt
File "src/borg/crypto/low_level.pyx", line 371, in borg.crypto.low_level.AES256_CTR_HMAC_SHA256.mac_verify
borg.crypto.low_level.IntegrityError: MAC Authentication failed
Platform: Linux 200a74d01341 6.8.0-60-generic #63-Ubuntu SMP PREEMPT_DYNAMIC Tue Apr 15 19:04:15 UTC 2025 x86_64
Linux: Unknown Linux
Borg: 1.4.0 Python: CPython 3.13.3 msgpack: 1.0.3 fuse: pyfuse3 3.4.0 [pyfuse3,llfuse]
PID: 186 CWD: /
sys.argv: ['/usr/bin/borg', 'extract', '--progress', '::borgbackup-2025-08-14T02:30:47', 'source/data/Documents/Common/InstantUpload/sdd6.bkp']
SSH_ORIGINAL_COMMAND: None
With "borg mount" followed by "rsync -a", borg mount point "disappears" at some random point, and rsync starts filling junk data further. I identify this point when the rsync suddenly becomes lot faster than during normal transfer (around 150MBps vs 6MBps during normal transfer). File gets corrupted, and rsync cannot continue with the "--partial --append-verify", after remounting the backup again.
First attempt:
Documents/Common/InstantUpload/sdd6.bkp
204,899,917,824 100% 107.84MB/s 0:30:12 (xfr#509, ir-chk=1548/44622)
rsync: [sender] read errors mapping "/mnt/source/data/Documents/Common/InstantUpload/sdd6.bkp": Transport endpoint is not connected (107)
Further attempts through truncating junk data and re-attempting the process.
rsync -a --progress --inplace --partial --append /mnt/source/data/Documents/Common/InstantUpload/sdd6.bkp /source/data/Documents/Common/InstantUpload/sdd6.bkp
sending incremental file list
sdd6.bkp
54,751,258,249 26% 5.65MB/s 7:12:48
54,922,110,601 26% 5.43MB/s 7:29:44
55,191,725,705 26% 5.04MB/s 8:03:03
55,370,147,465 27% 5.93MB/s 6:50:34
55,696,352,905 27% 6.03MB/s 6:43:02
55,888,111,241 27% 13.43MB/s 3:00:31
56,063,649,417 27% 5.65MB/s 7:08:49
56,257,209,993 27% 6.88MB/s 5:51:33
56,588,101,257 27% 5.75MB/s 6:59:46
57,155,085,961 27% 6.15MB/s 6:30:49
58,604,086,921 28% 5.93MB/s 6:41:31
59,165,304,457 28% 6.00MB/s 6:35:16
59,872,503,433 29% 5.82MB/s 6:45:37
60,917,802,633 29% 5.71MB/s 6:50:41
62,069,794,441 30% 5.84MB/s 6:37:47
63,058,863,753 30% 5.76MB/s 6:40:29
98,050,926,217 47% 177.07MB/s 0:09:49 ^C
Truncate file to 63,058,863,753 and "borg mount" and retry (not sure if everything before 63,058,863,753 is good).
rsync -a --progress --inplace --partial --append /mnt/source/data/Documents/Common/InstantUpload/sdd6.bkp /source/data/Documents/Common/InstantUpload/sdd6.bkp
sending incremental file list
sdd6.bkp
64,650,307,209 31% 5.66MB/s 6:43:02
64,926,770,825 31% 6.00MB/s 6:19:29
65,940,350,601 32% 5.92MB/s 6:22:02
67,771,786,889 33% 4.92MB/s 7:33:36
68,296,238,729 33% 5.84MB/s 6:20:46
68,574,504,585 33% 5.85MB/s 6:19:35
69,267,285,641 33% 6.18MB/s 5:57:08
70,941,927,049 34% 6.32MB/s 5:45:13
71,722,657,417 35% 4.54MB/s 7:57:49
73,971,492,489 36% 5.86MB/s 6:03:23
74,813,499,017 36% 5.25MB/s 6:43:35
75,544,848,009 36% 5.20MB/s 6:44:46
76,099,937,929 37% 5.57MB/s 6:16:20
77,396,829,833 37% 5.60MB/s 6:10:52
79,414,978,185 38% 6.45MB/s 5:16:34
81,001,309,833 39% 5.81MB/s 5:47:06
82,984,134,281 40% 5.87MB/s 5:38:05
89,410,201,225 43% 6.33MB/s 4:56:45
90,677,536,393 44% 6.38MB/s 4:51:28
92,801,656,457 45% 5.11MB/s 5:57:07
93,183,731,337 45% 5.58MB/s 5:25:46
177,093,535,369 86% 193.84MB/s 0:02:20
So even manually it is becoming an impossible task to restore a large file.
@a9183756-gh This means that the data is corrupted:
borg.crypto.low_level.IntegrityError: MAC Authentication failed
If this happens at different, random places within that big file, a hardware issue is likely the root cause (some hw between disk and cpu corrupting bits). If it would always happen at the same place, it could be also that the on-disk data is already corrupt.
So: first check your hardware and make sure it works correctly (especially run memtest86+ to check RAM). after that, run borg check [--repair] on the repository.