borg icon indicating copy to clipboard operation
borg copied to clipboard

corruption of repository nonce during a "borg create" run? (binascii.Error: Non-hexadecimal digit found)

Open mnalis opened this issue 2 years ago • 9 comments

Have you checked borgbackup docs, FAQ, and open GitHub issues?

Yes. There are related issues https://github.com/borgbackup/borg/issues/6031 and https://github.com/borgbackup/borg/issues/7266, and a FAQ "How important are the nonce files?" suggests a course of action.

Is this a BUG / ISSUE report or a QUESTION?

bug

System information. For client/server mode post info for both machines.

Your borg version (borg -V).

borg 1.2.4

Operating system (distribution) and version.

Debian Bookworm GNU/Linux

Hardware / network configuration, and filesystems used.

Dell i5 laptop with 16GB RAM, source NVMEs, with external 3.5TB USB HDD. Ext4 on both the source data and destination borg repository.

How much data is handled by borg?

about 1 TB in external HDD repository.

Full borg commandline that lead to the problem (leave away excludes and passwords)

sudo borg create --stats --progress --noflags --noacls --noxattrs --compression zstd --exclude-caches --exclude-from /etc/exclude.txt /media/ladica/backup/borg::{hostname}-{now}

Describe the problem you're observing.

The backup seems to have started normally, then after some time fails, apparently with corrupted "nonce". There seems to have been one transient HDD read error, but it did not repeat. Unmounting the external HDD and force fsck-ing the system showed only minor off-by-one free block count. The laptop has not been running kernel 6.1.64 with ext4 bug at any time (nor was that external HDD being used on other systems with such kernel). Shorter tests do not indicate that there are memory or CPU issues (nor does regular use of the system exhibits any other problems).

What is interesting is that the borg backup seems to have backed up over 4GB new data (after going over 980K files) before bailing out with that error. The nonce value seems corrupted with some JSON-alike text :

% sudo ls -l /media/ladica/backup/borg/nonce
[sudo] password for mnalis:
-rw------- 1 root root 16 Dec 16 01:23 /media/ladica/backup/borg/nonce
% sudo hd /media/ladica/backup/borg/nonce
00000000  7b 22 65 78 63 6c 75 73  69 76 65 22 3a 20 5b 5b  |{"exclusive": [[|
00000010

Can you reproduce the problem? If so, describe how. If not, describe troubleshooting steps you took before opening the issue.

I have not encountered that specific problem in the past, so likely not easily reproducible. I have stopped operating on the drive in order to not introduce further damage, and collect any more data that might be useful for troubleshooting. Trying to run sudo time borg check --last 2 /media/ladica/backup/borg does fail with binascii.Error: Non-hexadecimal digit found error too. (but have not repaired it).

It is my understanding from reading that FAQ that just removing that /media/ladica/backup/borg/nonce should allow backups to continue. However it seems interesting that the bug manifested in the middle of backup and with such JSON-alike value. The documentation mentions that it takes extra care that smaller hardware errors should not cause such corruption; so one would not think that one-time HDD read error (recoverable or not) should lead to such result?

So:

  • Do you want me to try something else in order to gather more data, or should I proceed with removing nonce and continuing backups?
  • ~/.config/borg/security/*/nonce seem to contain normal hexadecimal values, although they all seem more than a year old, and I've made backups in that repository a few weeks back?! There are several newer directories under .config/borg/security, but the do not contain files named nonce ? Is that to be expected, and will it affect future backups?

Include any warning/errors/backtraces from the system logs

fsck
% sudo fsck /dev/sda2 -C0 -r -f
fsck from util-linux 2.38.1
e2fsck 1.47.0 (5-Feb-2023)
Setting free blocks count to 694077103 (was 694077104)
ts_ladica34: clean, 2802/237633536 files, 256454476/950531579 blocks
/dev/sda2: status 0, rss 5328, real 1.010703, user 0.459906, sys 0.004034

dmesg noting one read error some half minute after mounting the external HDD.
[180509.470199] usb 4-1: new SuperSpeed USB device number 3 using xhci_hcd
[180509.491375] usb 4-1: New USB device found, idVendor=1058, idProduct=2621, bcdDevice=10.34
[180509.491384] usb 4-1: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[180509.491388] usb 4-1: Product: Elements 2621
[180509.491391] usb 4-1: Manufacturer: Western Digital
[180509.491394] usb 4-1: SerialNumber: 575836324441315045544836
[180509.493221] usb-storage 4-1:1.0: USB Mass Storage device detected
[180509.493585] scsi host0: usb-storage 4-1:1.0
[180510.503374] scsi 0:0:0:0: Direct-Access     WD       Elements 2621    1034 PQ: 0 ANSI: 6
[180510.503784] sd 0:0:0:0: Attached scsi generic sg0 type 0
[180510.505170] sd 0:0:0:0: [sda] Spinning up disk...
[180511.526112] .........ready
[180519.719108] sd 0:0:0:0: [sda] Very big device. Trying to use READ CAPACITY(16).
[180519.719346] sd 0:0:0:0: [sda] 7813969920 512-byte logical blocks: (4.00 TB/3.64 TiB)
[180519.719349] sd 0:0:0:0: [sda] 4096-byte physical blocks
[180519.719700] sd 0:0:0:0: [sda] Write Protect is off
[180519.719708] sd 0:0:0:0: [sda] Mode Sense: 47 00 10 08
[180519.720058] sd 0:0:0:0: [sda] No Caching mode page found
[180519.720067] sd 0:0:0:0: [sda] Assuming drive cache: write through
[180519.919029]  sda: sda1 sda2
[180519.919625] sd 0:0:0:0: [sda] Attached SCSI disk
[180521.459664] EXT4-fs (sda2): mounted filesystem with ordered data mode. Quota mode: none.
[180553.197059] sd 0:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
[180553.197065] sd 0:0:0:0: [sda] tag#0 Sense Key : Aborted Command [current]
[180553.197066] sd 0:0:0:0: [sda] tag#0 Add. Sense: Information unit iuCRC error detected
[180553.197068] sd 0:0:0:0: [sda] tag#0 CDB: Read(16) 88 00 00 00 00 01 76 00 08 00 00 00 00 08 00 00
[180553.197069] I/O error, dev sda, sector 6274680832 op 0x0:(READ) flags 0x83700 phys_seg 1 prio class 2

borg error:

% sudo borg create --stats --progress --noflags --noacls --noxattrs --compression zstd --exclude-caches --exclude-from /etc/exclude.txt  /media/ladica/backup/borg::{hostname}-{now} /
Enter passphrase for key /media/ladica/backup/borg:
Local Exception.32 GB C 4.26 GB D 980645 N home/mnalis/Videos/s/200 Incredible Moments Caught On Camera ! Best Of The Month [lFTldpIEnV0].webm
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/borg/archiver.py", line 5213, in main
    exit_code = archiver.run(args)
                ^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/borg/archiver.py", line 5144, in run
    return set_ec(func(args))
                  ^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/borg/archiver.py", line 183, in wrapper
    return method(self, args, repository=repository, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/borg/archiver.py", line 675, in do_create
    create_inner(archive, cache, fso)
  File "/usr/lib/python3/dist-packages/borg/archiver.py", line 608, in create_inner
    self._rec_walk(path=path, parent_fd=parent_fd, name=name,
  File "/usr/lib/python3/dist-packages/borg/archiver.py", line 823, in _rec_walk
    self._rec_walk(
  File "/usr/lib/python3/dist-packages/borg/archiver.py", line 823, in _rec_walk
    self._rec_walk(
  File "/usr/lib/python3/dist-packages/borg/archiver.py", line 823, in _rec_walk
    self._rec_walk(
  [Previous line repeated 2 more times]
  File "/usr/lib/python3/dist-packages/borg/archiver.py", line 786, in _rec_walk
    status = self._process_any(path=path, parent_fd=parent_fd, name=name, st=st, fso=fso, cache=cache,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/borg/archiver.py", line 688, in _process_any
    return fso.process_file(path=path, parent_fd=parent_fd, name=name, st=st, cache=cache)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/borg/archive.py", line 1451, in process_file
    self.process_file_chunks(item, cache, self.stats, self.show_progress, backup_io_iter(self.chunker.chunkify(None, fd)))
  File "/usr/lib/python3/dist-packages/borg/archive.py", line 1273, in process_file_chunks
    item.chunks.append(chunk_processor(chunk))
                       ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/borg/archive.py", line 1261, in chunk_processor
    chunk_entry = cache.add_chunk(chunk_id, data, stats, wait=False)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/borg/cache.py", line 949, in add_chunk
    data = self.key.encrypt(chunk)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/borg/crypto/key.py", line 370, in encrypt
    next_iv = self.nonce_manager.ensure_reservation(self.cipher.next_iv(),
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/borg/crypto/nonces.py", line 79, in ensure_reservation
    repo_free_nonce = self.get_repo_free_nonce()
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/borg/crypto/nonces.py", line 38, in get_repo_free_nonce
    return self.repository.get_free_nonce()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/borg/repository.py", line 352, in get_free_nonce
    return int.from_bytes(unhexlify(fd.read()), byteorder='big')
                          ^^^^^^^^^^^^^^^^^^^^
binascii.Error: Non-hexadecimal digit found

Platform: Linux leia 6.1.0-15-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.66-1 (2023-12-09) x86_64
Linux: Unknown Linux
Borg: 1.2.4  Python: CPython 3.11.2 msgpack: 1.0.3 fuse: llfuse 1.4.1 [pyfuse3,llfuse]
PID: 31422  CWD: /home/mnalis
sys.argv: ['/usr/bin/borg', 'create', '--stats', '--progress', '--noflags', '--noacls', '--noxattrs', '--compression', 'zstd', '--exclude-caches', '--exclude-from', '/etc/exclude.txt', '/media/ladica/backup/borg::{hostname}-{now}', '/']
SSH_ORIGINAL_COMMAND: None

%

mnalis avatar Dec 17 '23 02:12 mnalis

Hmm, that does not look like a borg error, but rather that the repo disk fs or disk hw was confused somehow.

Borg never writes such json content to the nonce file - that content rather looks like from a lock file.

You already ran a fsck -f, so maybe also do a smartctl -t long on that disk and afterwards check smart status again using smartctl -a (you can post the output here, if unsure).

Other than that, yes I'ld just remove the defect nonce file from the repo (it's only making troubles as it is now) and retry.

ThomasWaldmann avatar Dec 17 '23 18:12 ThomasWaldmann

Thanks; long self-test seems to finish without any problems, and SMART results do not look problematic to me:

% sudo smartctl -a /dev/sda
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-15-amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Elements / My Passport (USB, AF)
Device Model:     WDC WD40NDZW-11BCSS0
Serial Number:    WD-WX62DA1PETH6
LU WWN Device Id: 5 0014ee 26a079696
Firmware Version: 01.01A01
User Capacity:    4.000.753.475.584 bytes [4,00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    4800 rpm
Form Factor:      2.5 inches
TRIM Command:     Available, deterministic
Device is:        In smartctl database 7.3/5319
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Dec 18 03:44:27 2023 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x04) Offline data collection activity
                                        was suspended by an interrupting command from host.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (10320) seconds.
Offline data collection
capabilities:                    (0x1b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        No Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  28) minutes.
SCT capabilities:              (0x30b5) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   253   253   021    Pre-fail  Always       -       2058
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       266
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       134
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       241
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       200
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       1268
194 Temperature_Celsius     0x0022   119   093   000    Old_age   Always       -       33
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%       133         -
# 2  Short offline       Completed without error       00%       128         -
# 3  Extended offline    Completed without error       00%       128         -

Selective Self-tests/Logging not supported

%

I've removed the nonce and on first look it seems new backups seem to work just fine, as is restoring new/old data.

That got me wondering how just that nonce file could get corrupted, with the rest of the disk and repository apparently in good condition? I've looked at commit_nonce_reservation which I think is responsible for saving nonce file (note that I'm completely new to borg codebase and not an python expert; my background is more in C) :

https://github.com/borgbackup/borg/blob/4c131059bc9e90753160c07bf26165278aeab69a/src/borg/repository.py#L365-L369

and it seems that it relies on SaveFile to make sure that the data is safely on disk before continuing there.

However if I'm reading that correctly, SaveFile seems to only do close(2), which does not guarantee that data is safely on disk? i.e.

https://github.com/borgbackup/borg/blob/1432f547fe4fe13386b245dcc0d9c2a0880ddfe6/src/borg/platform/base.py#L239-L244

Would it make sense to either:

  • explicitly fsync(2) file data before close(2) in SaveFile __exit__?
  • explicitly fsync(2) file data in commit_nonce_reservation after write(2) (if above would be performance killer in other cases where it is unneeded - nonce is not being written all that frequently if I understand correctly, so it should be OK to do it just there?)
  • I notice that there is also class SyncFile which perhaps might be used for nonce file (the function descriptions do not seem to explain when one is supposed to use one or the other)?

mnalis avatar Dec 18 '23 14:12 mnalis

Need to have a deeper look there.

But I guess a missing fsync does not explain that it showed the data from a completely different file.

ThomasWaldmann avatar Dec 18 '23 16:12 ThomasWaldmann

Need to have a deeper look there.

Thanks, I've split that fsync(2) suggestion into separate issue https://github.com/borgbackup/borg/issues/7969

But I guess a missing fsync does not explain that it showed the data from a completely different file.

Yeah, not easily explainable. There could be interactions with write reorders and unclean shutdown that lead to that, but that reason for this particular issue has suddenly become much less probable - as the issue just reproduced with exactly the same bug.

Timeline:

  • after initial problem reported here, I did fsck -f and smartctl -t long to make sure no physical/logical issues on that HDD. -Then, I've removed nonce file and successfully run new borg create, which didn't complain and seemed to complete just fine. (didn't keep logs of it, though, as I didn't expect this to repeat).
  • I've done (many hours long) borg check -v -p --repair /media/ladica/backup/borg and there seems to have been one file with Previously missing file chunk is still missing error. Check seemed to complete OK:
% (date; sudo time borg check -v -p --repair /media/ladica/backup/borg ; date) | tee ~/tmp/borg-repair-20231218b.txt
Mon Dec 18 15:59:53 CET 2023
This is a potentially dangerous function.
check --repair might lead to data loss (for kinds of corruption it is not
capable of dealing with). BE VERY CAREFUL!

Type 'YES' if you understand this and want to continue: YES
Starting repository check
finished segment check at segment 2788
Starting repository index check
Finished full repository check, no problems found.
Starting archive consistency check...
Enter passphrase for key /media/ladica/backup/borg:
Analyzing archive leia-2022-11-28T23:40:49 (1/100)
Analyzing archive leia-2022-11-29T01:34:42 (2/100)
Analyzing archive skywalker-2022-11-29T14:19:21 (3/100)
Analyzing archive leia-2022-11-29T18:05:52 (4/100)
Analyzing archive leia-2022-11-29T18:30:56 (5/100)
Analyzing archive leia-2022-12-08T13:10:50 (6/100)
Analyzing archive leia-2022-12-09T01:14:12 (7/100)
Analyzing archive leia-2022-12-09T01:28:19 (8/100)
Analyzing archive leia-2022-12-09T19:56:15 (9/100)
Analyzing archive leia-2022-12-09T20:12:02 (10/100)
Analyzing archive leia-2022-12-27T14:40:11 (11/100)
Analyzing archive leia-2023-01-24T22:59:18.checkpoint (12/100)
Analyzing archive leia-2023-01-24T23:05:06 (13/100)
Analyzing archive leia-2023-01-24T23:16:09 (14/100)
Analyzing archive leia-2023-01-25T03:07:01 (15/100)
Analyzing archive leia-2023-01-25T03:12:02 (16/100)
Analyzing archive leia-2023-01-26T14:27:44 (17/100)
Analyzing archive skywalker-2023-01-26T14:31:55 (18/100)
Analyzing archive skywalker-2023-01-26T14:57:18 (19/100)
Analyzing archive leia-2023-01-30T23:12:06 (20/100)
Analyzing archive torres-20230130-backup-2023-01-30T23:39:12 (21/100)
Analyzing archive leia-2023-01-31T05:24:17 (22/100)
Analyzing archive leia-2023-02-06T13:35:42 (23/100)
Analyzing archive leia-2023-02-13T01:10:18 (24/100)
Analyzing archive leia-2023-02-16T17:58:06 (25/100)
Analyzing archive leia-2023-02-17T06:28:50 (26/100)
Analyzing archive leia-2023-03-20T12:06:09 (27/100)
Analyzing archive HuaweiP30-2023-03-20T12:58:27.checkpoint (28/100)
Analyzing archive HuaweiP30-2023-03-20T13:03:13.checkpoint (29/100)
Analyzing archive HuaweiP30-2023-03-20T13:11:30 (30/100)
Analyzing archive torres-owncloud-2023-03-20T14:28:39 (31/100)
Analyzing archive torres-owncloud-2023-03-20T14:33:21.checkpoint (32/100)
Analyzing archive torres-owncloud-2023-03-20T19:12:30.checkpoint (33/100)
torres-owncloud-2023-03-20T19:12:30.checkpoint: home/mnalis/mnt/var/lib/owncloud/data/kkulic/files/InstantUpload/VID_20170617_073952.mp4: Previously missing file chunk is still missing (Byte 1050941-5836528, Chunk 9789d6da073b872b105ae67dfc26efdbce4af2649816c1cb500eb9429e95007f). It has an all-zero replacement chunk already.
torres-owncloud-2023-03-20T19:12:30.checkpoint: home/mnalis/mnt/var/lib/owncloud/data/kkulic/files/InstantUpload/VID_20170617_073952.mp4: Previously missing file chunk is still missing (Byte 5836528-7584655, Chunk d3137814b599f7f2cd68375c30c8f561e4edcb81d6412488ef1fe2eeb6bf44b4). It has an all-zero replacement chunk already.
Analyzing archive torres-owncloud-2023-03-20T23:07:53.checkpoint (34/100)
Analyzing archive torres-owncloud-2023-03-21T02:21:30 (35/100)
torres-owncloud-2023-03-21T02:21:30: home/mnalis/mnt/var/lib/owncloud/data/kkulic/files/InstantUpload/VID_20170617_073952.mp4: Previously missing file chunk is still missing (Byte 1050941-5836528, Chunk 9789d6da073b872b105ae67dfc26efdbce4af2649816c1cb500eb9429e95007f). It has an all-zero replacement chunk already.
torres-owncloud-2023-03-21T02:21:30: home/mnalis/mnt/var/lib/owncloud/data/kkulic/files/InstantUpload/VID_20170617_073952.mp4: Previously missing file chunk is still missing (Byte 5836528-7584655, Chunk d3137814b599f7f2cd68375c30c8f561e4edcb81d6412488ef1fe2eeb6bf44b4). It has an all-zero replacement chunk already.
Analyzing archive torres-owncloud-2023-03-21T13:55:34 (36/100)
Analyzing archive torres-owncloud-2023-03-21T13:57:03 (37/100)
torres-owncloud-2023-03-21T13:57:03: home/mnalis/mnt/var/lib/owncloud/data/kkulic/files/InstantUpload/VID_20170617_073952.mp4: Previously missing file chunk is still missing (Byte 1050941-5836528, Chunk 9789d6da073b872b105ae67dfc26efdbce4af2649816c1cb500eb9429e95007f). It has an all-zero replacement chunk already.
torres-owncloud-2023-03-21T13:57:03: home/mnalis/mnt/var/lib/owncloud/data/kkulic/files/InstantUpload/VID_20170617_073952.mp4: Previously missing file chunk is still missing (Byte 5836528-7584655, Chunk d3137814b599f7f2cd68375c30c8f561e4edcb81d6412488ef1fe2eeb6bf44b4). It has an all-zero replacement chunk already.
Analyzing archive torres-owncloud-2023-03-22T01:12:11.checkpoint (38/100)
Analyzing archive torres-owncloud-2023-03-22T01:22:32 (39/100)
torres-owncloud-2023-03-22T01:22:32: home/mnalis/mnt/var/lib/owncloud/data/kkulic/files/InstantUpload/VID_20170617_073952.mp4: Previously missing file chunk is still missing (Byte 1050941-5836528, Chunk 9789d6da073b872b105ae67dfc26efdbce4af2649816c1cb500eb9429e95007f). It has an all-zero replacement chunk already.
torres-owncloud-2023-03-22T01:22:32: home/mnalis/mnt/var/lib/owncloud/data/kkulic/files/InstantUpload/VID_20170617_073952.mp4: Previously missing file chunk is still missing (Byte 5836528-7584655, Chunk d3137814b599f7f2cd68375c30c8f561e4edcb81d6412488ef1fe2eeb6bf44b4). It has an all-zero replacement chunk already.
Analyzing archive torres-2023-03-22T01:31:44 (40/100)
torres-2023-03-22T01:31:44: home/mnalis/mnt/var/lib/owncloud/data/kkulic/files/InstantUpload/VID_20170617_073952.mp4: Previously missing file chunk is still missing (Byte 1050941-5836528, Chunk 9789d6da073b872b105ae67dfc26efdbce4af2649816c1cb500eb9429e95007f). It has an all-zero replacement chunk already.
torres-2023-03-22T01:31:44: home/mnalis/mnt/var/lib/owncloud/data/kkulic/files/InstantUpload/VID_20170617_073952.mp4: Previously missing file chunk is still missing (Byte 5836528-7584655, Chunk d3137814b599f7f2cd68375c30c8f561e4edcb81d6412488ef1fe2eeb6bf44b4). It has an all-zero replacement chunk already.
Analyzing archive skywalker-2023-03-23T13:46:42 (41/100)
Analyzing archive XiaomiRedmiNote6Pro-2023-03-23T22:13:30 (42/100)
Analyzing archive XiaomiRedmiNote6Pro-2023-03-23T22:14:34 (43/100)
Analyzing archive XiaomiRedmiNote6Pro-2023-03-23T23:03:59 (44/100)
Analyzing archive XiaomiRedmiNote6Pro-2023-03-23T23:07:45 (45/100)
Analyzing archive XiaomiRedmiNote6Pro-2023-03-24T00:32:48 (46/100)
Analyzing archive XiaomiRedmiNote6Pro-2023-03-24T01:54:45.checkpoint (47/100)
Analyzing archive XiaomiRedmiNote6Pro-2023-03-24T01:57:19 (48/100)
Analyzing archive XiaomiRedmiNote6Pro-2023-03-24T02:04:04 (49/100)
Analyzing archive leia-2023-03-24T03:05:55 (50/100)
Analyzing archive leia-2023-04-04T16:54:39 (51/100)
Analyzing archive leia-2023-04-18T04:12:09 (52/100)
Analyzing archive leia-2023-05-26T15:15:48 (53/100)
Analyzing archive HuaweiP30-2023-05-26T15:25:10.checkpoint (54/100)
Analyzing archive HuaweiP30-2023-05-26T15:28:22 (55/100)
Analyzing archive HuaweiP30-2023-05-26T16:18:30 (56/100)
Analyzing archive skywalker-2023-05-26T16:37:13 (57/100)
Analyzing archive leia-2023-06-24T16:33:16 (58/100)
Analyzing archive HuaweiP30-2023-06-24T16:44:22.checkpoint (59/100)
Analyzing archive HuaweiP30-2023-06-24T17:13:04 (60/100)
Analyzing archive HuaweiP30-2023-06-24T17:14:25 (61/100)
Analyzing archive HuaweiP30-2023-06-24T18:28:13 (62/100)
Analyzing archive leia-2023-08-27T20:03:19 (63/100)
Analyzing archive HuaweiP30-2023-08-27T20:16:53.checkpoint (64/100)
Analyzing archive HuaweiP30-2023-08-27T23:16:10.checkpoint (65/100)
Analyzing archive HuaweiP30-2023-08-28T15:53:26 (66/100)
Analyzing archive HuaweiP30-2023-08-28T16:52:36 (67/100)
Analyzing archive HuaweiP30-2023-08-28T17:07:45 (68/100)
Analyzing archive HuaweiP30-2023-08-28T17:15:06 (69/100)
Analyzing archive skywalker-2023-08-28T17:57:29 (70/100)
Analyzing archive leia-2023-09-25T17:15:40 (71/100)
Analyzing archive skywalker-2023-10-01T11:19:00 (72/100)
Analyzing archive ivana-stari-acer-laptop-2023-10-01T15:39:47 (73/100)
Analyzing archive HuaweiP30-2023-10-01T23:54:59 (74/100)
Analyzing archive HuaweiP30-2023-10-02T02:50:38 (75/100)
Analyzing archive leia-2023-10-02T04:20:24.checkpoint (76/100)
Analyzing archive leia-2023-10-02T16:06:55 (77/100)
Analyzing archive Xiaomi_SD64-2023-10-02T16:29:54.checkpoint (78/100)
Analyzing archive XiaomiRedmiNote6Pro_SD64-2023-10-02T16:31:02 (79/100)
Analyzing archive XiaomiRedmiNote6Pro_SD64-2023-10-02T16:31:24 (80/100)
Analyzing archive leia-2023-10-02T18:08:01 (81/100)
Analyzing archive XiaomiRedmiNote6Pro_SD64-2023-10-02T18:21:07 (82/100)
Analyzing archive XiaomiRedmiNote6Pro_internal1-2023-10-02T23:19:39 (83/100)
Analyzing archive XiaomiRedmiNote6Pro_internal2-2023-10-02T23:24:03 (84/100)
Analyzing archive leia-2023-10-08T01:36:16 (85/100)
Analyzing archive leia-2023-10-27T11:29:29 (86/100)
Analyzing archive HuaweiP30-2023-10-27T11:43:05 (87/100)
Analyzing archive HuaweiP30-2023-10-27T12:54:38 (88/100)
Analyzing archive HuaweiP30-2023-10-27T13:28:21 (89/100)
Analyzing archive HuaweiP30-2023-10-27T13:32:13 (90/100)
Analyzing archive leia-2023-11-14T13:03:32 (91/100)
Analyzing archive leia-2023-12-02T08:18:10 (92/100)
Analyzing archive leia-2023-12-02T08:49:52 (93/100)
Analyzing archive leia-2023-12-02T09:13:39 (94/100)
Analyzing archive leia-2023-12-05T21:58:52 (95/100)
Analyzing archive leia-2023-12-11T21:31:49 (96/100)
Analyzing archive leia-2023-12-11T22:09:33 (97/100)
Analyzing archive leia-2023-12-13T22:57:25 (98/100)
Analyzing archive leia-2023-12-17T21:37:45 (99/100)
Analyzing archive leia-2023-12-17T21:50:01 (100/100)
Deleting 0 orphaned and 194 superseded objects...
Finished deleting orphaned/superseded objects.
Writing Manifest.
Committing repo.
Archive consistency check complete, no problems found.
1171.22user 657.82system 3:14:47elapsed 15%CPU (0avgtext+0avgdata 397012maxresident)k
1932064520inputs+487960outputs (60major+704024minor)pagefaults 0swaps
Mon Dec 18 19:14:41 CET 2023
  • just to be sure, I followed it with another check, this time with verify-data borg check -v -p --verify-data /media/ladica/backup/borg. It also worked for many hours, and then bailed out with same nonce corruption -- even the content looks like same JSON as before.
% (date; sudo time borg check -v -p --verify-data /media/ladica/backup/borg ; date) | tee -a ~/tmp/borg-repair-20231218c.txt
[sudo] password for mnalis: Mon Dec 18 19:20:38 CET 2023

Starting repository check
finished segment check at segment 2792
Starting repository index check
Index object count match.
Finished full repository check, no problems found.
Starting archive consistency check...
Enter passphrase for key /media/ladica/backup/borg:
Starting cryptographic data integrity verification...
Finished cryptographic data integrity verification, verified 2258954 chunks with 0 integrity errors.
Analyzing archive leia-2022-11-28T23:40:49 (1/100)
Local Exception
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/borg/archiver.py", line 5213, in main
    exit_code = archiver.run(args)
                ^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/borg/archiver.py", line 5144, in run
    return set_ec(func(args))
                  ^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/borg/archiver.py", line 183, in wrapper
    return method(self, args, repository=repository, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/borg/archiver.py", line 347, in do_check
    if not args.repo_only and not ArchiveChecker().check(
                                  ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/borg/archive.py", line 1647, in check
    self.rebuild_refcounts(archive=archive, first=first, last=last, sort_by=sort_by, glob=glob)
  File "/usr/lib/python3/dist-packages/borg/archive.py", line 2059, in rebuild_refcounts
    items_buffer.add(item)
  File "/usr/lib/python3/dist-packages/borg/archive.py", line 357, in add
    self.flush()
  File "/usr/lib/python3/dist-packages/borg/archive.py", line 385, in flush
    self.chunks.append(self.write_chunk(chunk))
                       ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/borg/archive.py", line 1846, in add_callback
    cdata = self.key.encrypt(chunk)
            ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/borg/crypto/key.py", line 370, in encrypt
    next_iv = self.nonce_manager.ensure_reservation(self.cipher.next_iv(),
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/borg/crypto/nonces.py", line 79, in ensure_reservation
    repo_free_nonce = self.get_repo_free_nonce()
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/borg/crypto/nonces.py", line 38, in get_repo_free_nonce
    return self.repository.get_free_nonce()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/borg/repository.py", line 352, in get_free_nonce
    return int.from_bytes(unhexlify(fd.read()), byteorder='big')
                          ^^^^^^^^^^^^^^^^^^^^
binascii.Error: Non-hexadecimal digit found
Platform: Linux leia 6.1.0-15-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.66-1 (2023-12-09) x86_64
Linux: Unknown Linux
Borg: 1.2.4  Python: CPython 3.11.2 msgpack: 1.0.3 fuse: llfuse 1.4.1 [pyfuse3,llfuse]
PID: 32758  CWD: /home/mnalis
sys.argv: ['/usr/bin/borg', 'check', '-v', '-p', '--verify-data', '/media/ladica/backup/borg']
SSH_ORIGINAL_COMMAND: None

Command exited with non-zero status 2
3688.13user 1156.81system 7:20:27elapsed 18%CPU (0avgtext+0avgdata 397468maxresident)k
3845164672inputs+48outputs (35major+148304minor)pagefaults 0swaps
Tue Dec 19 02:41:12 CET 2023

% sudo hd /media/ladica/backup/borg/nonce
00000000  7b 22 65 78 63 6c 75 73  69 76 65 22 3a 20 5b 5d  |{"exclusive": []|
00000010


Any ideas how to debug this further? Other commands/flags to run with, or some manual logging code to add somewhere? This repeat of the same issue might indicate some software bug/race condition somewhere, as it seems to me quite unlikely that repeating hardware issue would result exactly in same content ending up in the nonce file.

mnalis avatar Dec 19 '23 02:12 mnalis

Oh, it gets more weird. :-)

Could you copy the repo to a different disk / different enclosure and reproduce with that?

ThomasWaldmann avatar Dec 19 '23 09:12 ThomasWaldmann

BTW, SaveFile internally uses SyncFile which does flush and os.fdatasync (and falls back to os.fsync if platform has no os.fdatasync) internally before closing the file.

ThomasWaldmann avatar Dec 19 '23 09:12 ThomasWaldmann

One approach to debug this would be to add a logger.debug() to the locking code. IIRC it is the lock.roster file that has this json content.

ThomasWaldmann avatar Dec 19 '23 16:12 ThomasWaldmann

Guess the filesystem confusion seen here has nothing to do with borg, so we can't do anything about it.

But, what we can do is to produce a better error message in such a case.

ThomasWaldmann avatar Jan 16 '24 22:01 ThomasWaldmann

Guess the filesystem confusion seen here has nothing to do with borg, so we can't do anything about it.

Yeah. Unfortunately (Unfortunately for sheding some light on the issue, fortunately for me otherwise :smiley: ) I wasn't able to reproduce the bug in some dozen attempts in the meantime, so could not try to debug it either . While I agree that this part of the borg code looks fine; I still find it somewhat suspicious that bug repeated twice is such specific way with same content in wrong place.

But as you say, that might quite well be filesystem bug too, or python bug, or HDD firmware bug or something else. If it ever repeats and I get it to repeat at least somewhat reliably, I'll try to reproduce and let you know. Until such time, feel free to close this issue (when related things are done)

But, what we can do is to produce a better error message in such a case.

That would be nice, yes

mnalis avatar Jan 17 '24 01:01 mnalis

done.

ThomasWaldmann avatar Feb 20 '24 15:02 ThomasWaldmann