borg icon indicating copy to clipboard operation
borg copied to clipboard

`borg extract` should retry failed files (up to X times) instead of failing the entire job

Open shssoichiro opened this issue 6 months ago • 5 comments

When a file fails checksum verification during borg extract many times this is simply due to a network error, and can be fixed by retrying the file. When I'm restoring 25 TB of files, I don't want one random failure to force me to restart the entire job, especially since there is no deduplication mechanism (ala rsync) in borg extract.

To me, this is a critical issue, because borg is unable to perform the primary job that I expect of it: restoring my files from backup. Instead I have to do a tedious workaround of getting a list of all files that need restored, and doing them one directory at a time, when there are hundreds of directories.

shssoichiro avatar Jun 15 '25 20:06 shssoichiro

Please post the traceback and the borg version(s), so we can exactly see what failed.

Also I don't quite understand how you come from "when a file fails checksum verification MANY times" to "can be fixed by retrying the file".

To me that rather sounds like: if it fails many times, it likely fails always and retrying is pointless.

Also, it is not really clear from the given facts that this is a network error, it could be also corruption in the repository due to a fs or media error.

But of course I see that failing the whole extraction isn't good, especially if the extraction volume is huge. To see how / where exactly it fails, we need the traceback.

About "no deduplication mechanism": this isn't easy for the general case, but borg2 / master branch now at least skips re-extracting files if a local file is present and the metadata matches perfectly (size and mtime, IIRC), which is already quite good for continuing an interrupted restore run.

If it is precisely one or a few files that have issues, guess you could also try to exclude them (try that first with a smaller archive, so you'll see how to get a successful exclusion).

Also find out and fix the root cause for that issue.

ThomasWaldmann avatar Jun 16 '25 15:06 ThomasWaldmann

I suppose I phrased that poorly. The same file is not failing each time. It seems to randomly fail after some large amount of data (about 100GB on average), but it's a different file each time. Any given file will succeed if retried.

This is on borg 1.4.1, both the client and server, on Linux. Neither machine is overclocked in any way, though I suppose that doesn't rule out some sort of memory issue causing intermittent read/write corruption (both disk arrays are on mdraid RAID5, and show no issues on mdstat or smartctl). I will post a full traceback next time I encounter it. I've been having better success doing the restore in smaller batches (one directory at a time), though it's tedious.

shssoichiro avatar Jun 16 '25 15:06 shssoichiro

Sure enough, memory issues do appear to be responsible for the checksum failures. That being said, I do still think a feature such as this would be very helpful for large restore jobs.

Edit: I'll also note that I found a good workaround using existing features, which is to use borg mount to mount the archive, then use rsync for resumable restoring.

Edit 2: That didn't work. It turns out there are files in the repo that are actually corrupted, and when rsync reaches these files, borg's mount disconnects. I suppose the next step is attempting to borg check --repair and make note of which files were corrupted so I can manually restore them. 🤞 This whole process is quite unfriendly to the user though. In my opinion it would be much less frustrating if borg extract would first retry a corrupted file (maybe 3 times by default), and if it continues failing, skip it, continue on, and report all failures at the end of the extract.

shssoichiro avatar Jun 16 '25 22:06 shssoichiro

I am never been successful in restoring a very large file (~200GB). I tried both "borg extract" and "borg mount"+rsync.

"borg extract" gives this backtrace (at random times) and starts from beginning on every re-attempt. This error occurs only during real extraction and not during dry-run (with -n option). So, I suspected some hard drive issue initially, but there is no kernel disconnect error or failure seen in the mounted drive, and drive continues to work normally. Can this be a network checksum error as describe in this issue?

borg extract --progress ::borgbackup-2025-08-14T02:30:47 source/data/Documents/Common/InstantUpload/sdd6.bkp
Local Exceptionng: source/data/Documents/Common/InstantUpload/sdd6.bkp                                                                                                                  Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/borg/archiver.py", line 5391, in main
    exit_code = archiver.run(args)
  File "/usr/lib/python3/dist-packages/borg/archiver.py", line 5309, in run
    rc = func(args)
  File "/usr/lib/python3/dist-packages/borg/archiver.py", line 191, in wrapper
    return method(self, args, repository=repository, **kwargs)
  File "/usr/lib/python3/dist-packages/borg/archiver.py", line 206, in wrapper
    return method(self, args, repository=repository, manifest=manifest, key=key, archive=archive, **kwargs)
  File "/usr/lib/python3/dist-packages/borg/archiver.py", line 927, in do_extract
    archive.extract_item(item, stdout=stdout, sparse=sparse, hardlink_masters=hardlink_masters,
    ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                         stripped_components=strip_components, original_path=orig_path, pi=pi)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/borg/archive.py", line 816, in extract_item
    for data in self.pipeline.fetch_many(ids, is_preloaded=True):
                ~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/borg/archive.py", line 323, in fetch_many
    yield self.key.decrypt(id_, data)
          ~~~~~~~~~~~~~~~~^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/borg/crypto/key.py", line 461, in decrypt
    payload = self.cipher.decrypt(data)
  File "src/borg/crypto/low_level.pyx", line 289, in borg.crypto.low_level.AES256_CTR_BASE.decrypt
  File "src/borg/crypto/low_level.pyx", line 371, in borg.crypto.low_level.AES256_CTR_HMAC_SHA256.mac_verify
borg.crypto.low_level.IntegrityError: MAC Authentication failed

Platform: Linux 200a74d01341 6.8.0-60-generic #63-Ubuntu SMP PREEMPT_DYNAMIC Tue Apr 15 19:04:15 UTC 2025 x86_64
Linux: Unknown Linux  
Borg: 1.4.0  Python: CPython 3.13.3 msgpack: 1.0.3 fuse: pyfuse3 3.4.0 [pyfuse3,llfuse]
PID: 186  CWD: /
sys.argv: ['/usr/bin/borg', 'extract', '--progress', '::borgbackup-2025-08-14T02:30:47', 'source/data/Documents/Common/InstantUpload/sdd6.bkp']
SSH_ORIGINAL_COMMAND: None

With "borg mount" followed by "rsync -a", borg mount point "disappears" at some random point, and rsync starts filling junk data further. I identify this point when the rsync suddenly becomes lot faster than during normal transfer (around 150MBps vs 6MBps during normal transfer). File gets corrupted, and rsync cannot continue with the "--partial --append-verify", after remounting the backup again.

First attempt:

Documents/Common/InstantUpload/sdd6.bkp
204,899,917,824 100%  107.84MB/s    0:30:12 (xfr#509, ir-chk=1548/44622)
rsync: [sender] read errors mapping "/mnt/source/data/Documents/Common/InstantUpload/sdd6.bkp": Transport endpoint is not connected (107)

Further attempts through truncating junk data and re-attempting the process.

rsync -a --progress --inplace --partial --append /mnt/source/data/Documents/Common/InstantUpload/sdd6.bkp /source/data/Documents/Common/InstantUpload/sdd6.bkp 
sending incremental file list
sdd6.bkp
 54,751,258,249  26%    5.65MB/s    7:12:48  
 54,922,110,601  26%    5.43MB/s    7:29:44  
 55,191,725,705  26%    5.04MB/s    8:03:03  
 55,370,147,465  27%    5.93MB/s    6:50:34  
 55,696,352,905  27%    6.03MB/s    6:43:02  
 55,888,111,241  27%   13.43MB/s    3:00:31  
 56,063,649,417  27%    5.65MB/s    7:08:49  
 56,257,209,993  27%    6.88MB/s    5:51:33  
 56,588,101,257  27%    5.75MB/s    6:59:46  
 57,155,085,961  27%    6.15MB/s    6:30:49  
 58,604,086,921  28%    5.93MB/s    6:41:31  
 59,165,304,457  28%    6.00MB/s    6:35:16  
 59,872,503,433  29%    5.82MB/s    6:45:37  
 60,917,802,633  29%    5.71MB/s    6:50:41  
 62,069,794,441  30%    5.84MB/s    6:37:47  
 63,058,863,753  30%    5.76MB/s    6:40:29  
 98,050,926,217  47%  177.07MB/s    0:09:49  ^C

Truncate file to 63,058,863,753 and "borg mount" and retry (not sure if everything before 63,058,863,753 is good).

rsync -a --progress --inplace --partial --append /mnt/source/data/Documents/Common/InstantUpload/sdd6.bkp /source/data/Documents/Common/InstantUpload/sdd6.bkp 
sending incremental file list
sdd6.bkp
 64,650,307,209  31%    5.66MB/s    6:43:02  
 64,926,770,825  31%    6.00MB/s    6:19:29  
 65,940,350,601  32%    5.92MB/s    6:22:02  
 67,771,786,889  33%    4.92MB/s    7:33:36  
 68,296,238,729  33%    5.84MB/s    6:20:46  
 68,574,504,585  33%    5.85MB/s    6:19:35  
 69,267,285,641  33%    6.18MB/s    5:57:08  
 70,941,927,049  34%    6.32MB/s    5:45:13  
 71,722,657,417  35%    4.54MB/s    7:57:49  
 73,971,492,489  36%    5.86MB/s    6:03:23  
 74,813,499,017  36%    5.25MB/s    6:43:35  
 75,544,848,009  36%    5.20MB/s    6:44:46  
 76,099,937,929  37%    5.57MB/s    6:16:20  
 77,396,829,833  37%    5.60MB/s    6:10:52  
 79,414,978,185  38%    6.45MB/s    5:16:34  
 81,001,309,833  39%    5.81MB/s    5:47:06  
 82,984,134,281  40%    5.87MB/s    5:38:05  
 89,410,201,225  43%    6.33MB/s    4:56:45  
 90,677,536,393  44%    6.38MB/s    4:51:28  
 92,801,656,457  45%    5.11MB/s    5:57:07  
 93,183,731,337  45%    5.58MB/s    5:25:46  
177,093,535,369  86%  193.84MB/s    0:02:20

So even manually it is becoming an impossible task to restore a large file.

a9183756-gh avatar Aug 21 '25 06:08 a9183756-gh

@a9183756-gh This means that the data is corrupted:

borg.crypto.low_level.IntegrityError: MAC Authentication failed

If this happens at different, random places within that big file, a hardware issue is likely the root cause (some hw between disk and cpu corrupting bits). If it would always happen at the same place, it could be also that the on-disk data is already corrupt.

So: first check your hardware and make sure it works correctly (especially run memtest86+ to check RAM). after that, run borg check [--repair] on the repository.

ThomasWaldmann avatar Aug 21 '25 11:08 ThomasWaldmann