corruption of repository nonce during a "borg create" run? (binascii.Error: Non-hexadecimal digit found)
Have you checked borgbackup docs, FAQ, and open GitHub issues?
Yes. There are related issues https://github.com/borgbackup/borg/issues/6031 and https://github.com/borgbackup/borg/issues/7266, and a FAQ "How important are the nonce files?" suggests a course of action.
Is this a BUG / ISSUE report or a QUESTION?
bug
System information. For client/server mode post info for both machines.
Your borg version (borg -V).
borg 1.2.4
Operating system (distribution) and version.
Debian Bookworm GNU/Linux
Hardware / network configuration, and filesystems used.
Dell i5 laptop with 16GB RAM, source NVMEs, with external 3.5TB USB HDD. Ext4 on both the source data and destination borg repository.
How much data is handled by borg?
about 1 TB in external HDD repository.
Full borg commandline that lead to the problem (leave away excludes and passwords)
sudo borg create --stats --progress --noflags --noacls --noxattrs --compression zstd --exclude-caches --exclude-from /etc/exclude.txt /media/ladica/backup/borg::{hostname}-{now}
Describe the problem you're observing.
The backup seems to have started normally, then after some time fails, apparently with corrupted "nonce". There seems to have been one transient HDD read error, but it did not repeat. Unmounting the external HDD and force fsck-ing the system showed only minor off-by-one free block count. The laptop has not been running kernel 6.1.64 with ext4 bug at any time (nor was that external HDD being used on other systems with such kernel). Shorter tests do not indicate that there are memory or CPU issues (nor does regular use of the system exhibits any other problems).
What is interesting is that the borg backup seems to have backed up over 4GB new data (after going over 980K files) before bailing out with that error. The nonce value seems corrupted with some JSON-alike text :
% sudo ls -l /media/ladica/backup/borg/nonce
[sudo] password for mnalis:
-rw------- 1 root root 16 Dec 16 01:23 /media/ladica/backup/borg/nonce
% sudo hd /media/ladica/backup/borg/nonce
00000000 7b 22 65 78 63 6c 75 73 69 76 65 22 3a 20 5b 5b |{"exclusive": [[|
00000010
Can you reproduce the problem? If so, describe how. If not, describe troubleshooting steps you took before opening the issue.
I have not encountered that specific problem in the past, so likely not easily reproducible. I have stopped operating on the drive in order to not introduce further damage, and collect any more data that might be useful for troubleshooting.
Trying to run sudo time borg check --last 2 /media/ladica/backup/borg does fail with binascii.Error: Non-hexadecimal digit found error too. (but have not repaired it).
It is my understanding from reading that FAQ that just removing that /media/ladica/backup/borg/nonce should allow backups to continue. However it seems interesting that the bug manifested in the middle of backup and with such JSON-alike value. The documentation mentions that it takes extra care that smaller hardware errors should not cause such corruption; so one would not think that one-time HDD read error (recoverable or not) should lead to such result?
So:
- Do you want me to try something else in order to gather more data, or should I proceed with removing nonce and continuing backups?
~/.config/borg/security/*/nonceseem to contain normal hexadecimal values, although they all seem more than a year old, and I've made backups in that repository a few weeks back?! There are several newer directories under.config/borg/security, but the do not contain files namednonce? Is that to be expected, and will it affect future backups?
Include any warning/errors/backtraces from the system logs
fsck
% sudo fsck /dev/sda2 -C0 -r -f
fsck from util-linux 2.38.1
e2fsck 1.47.0 (5-Feb-2023)
Setting free blocks count to 694077103 (was 694077104)
ts_ladica34: clean, 2802/237633536 files, 256454476/950531579 blocks
/dev/sda2: status 0, rss 5328, real 1.010703, user 0.459906, sys 0.004034
dmesg noting one read error some half minute after mounting the external HDD.
[180509.470199] usb 4-1: new SuperSpeed USB device number 3 using xhci_hcd
[180509.491375] usb 4-1: New USB device found, idVendor=1058, idProduct=2621, bcdDevice=10.34
[180509.491384] usb 4-1: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[180509.491388] usb 4-1: Product: Elements 2621
[180509.491391] usb 4-1: Manufacturer: Western Digital
[180509.491394] usb 4-1: SerialNumber: 575836324441315045544836
[180509.493221] usb-storage 4-1:1.0: USB Mass Storage device detected
[180509.493585] scsi host0: usb-storage 4-1:1.0
[180510.503374] scsi 0:0:0:0: Direct-Access WD Elements 2621 1034 PQ: 0 ANSI: 6
[180510.503784] sd 0:0:0:0: Attached scsi generic sg0 type 0
[180510.505170] sd 0:0:0:0: [sda] Spinning up disk...
[180511.526112] .........ready
[180519.719108] sd 0:0:0:0: [sda] Very big device. Trying to use READ CAPACITY(16).
[180519.719346] sd 0:0:0:0: [sda] 7813969920 512-byte logical blocks: (4.00 TB/3.64 TiB)
[180519.719349] sd 0:0:0:0: [sda] 4096-byte physical blocks
[180519.719700] sd 0:0:0:0: [sda] Write Protect is off
[180519.719708] sd 0:0:0:0: [sda] Mode Sense: 47 00 10 08
[180519.720058] sd 0:0:0:0: [sda] No Caching mode page found
[180519.720067] sd 0:0:0:0: [sda] Assuming drive cache: write through
[180519.919029] sda: sda1 sda2
[180519.919625] sd 0:0:0:0: [sda] Attached SCSI disk
[180521.459664] EXT4-fs (sda2): mounted filesystem with ordered data mode. Quota mode: none.
[180553.197059] sd 0:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
[180553.197065] sd 0:0:0:0: [sda] tag#0 Sense Key : Aborted Command [current]
[180553.197066] sd 0:0:0:0: [sda] tag#0 Add. Sense: Information unit iuCRC error detected
[180553.197068] sd 0:0:0:0: [sda] tag#0 CDB: Read(16) 88 00 00 00 00 01 76 00 08 00 00 00 00 08 00 00
[180553.197069] I/O error, dev sda, sector 6274680832 op 0x0:(READ) flags 0x83700 phys_seg 1 prio class 2
borg error:
% sudo borg create --stats --progress --noflags --noacls --noxattrs --compression zstd --exclude-caches --exclude-from /etc/exclude.txt /media/ladica/backup/borg::{hostname}-{now} /
Enter passphrase for key /media/ladica/backup/borg:
Local Exception.32 GB C 4.26 GB D 980645 N home/mnalis/Videos/s/200 Incredible Moments Caught On Camera ! Best Of The Month [lFTldpIEnV0].webm
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/borg/archiver.py", line 5213, in main
exit_code = archiver.run(args)
^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/borg/archiver.py", line 5144, in run
return set_ec(func(args))
^^^^^^^^^^
File "/usr/lib/python3/dist-packages/borg/archiver.py", line 183, in wrapper
return method(self, args, repository=repository, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/borg/archiver.py", line 675, in do_create
create_inner(archive, cache, fso)
File "/usr/lib/python3/dist-packages/borg/archiver.py", line 608, in create_inner
self._rec_walk(path=path, parent_fd=parent_fd, name=name,
File "/usr/lib/python3/dist-packages/borg/archiver.py", line 823, in _rec_walk
self._rec_walk(
File "/usr/lib/python3/dist-packages/borg/archiver.py", line 823, in _rec_walk
self._rec_walk(
File "/usr/lib/python3/dist-packages/borg/archiver.py", line 823, in _rec_walk
self._rec_walk(
[Previous line repeated 2 more times]
File "/usr/lib/python3/dist-packages/borg/archiver.py", line 786, in _rec_walk
status = self._process_any(path=path, parent_fd=parent_fd, name=name, st=st, fso=fso, cache=cache,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/borg/archiver.py", line 688, in _process_any
return fso.process_file(path=path, parent_fd=parent_fd, name=name, st=st, cache=cache)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/borg/archive.py", line 1451, in process_file
self.process_file_chunks(item, cache, self.stats, self.show_progress, backup_io_iter(self.chunker.chunkify(None, fd)))
File "/usr/lib/python3/dist-packages/borg/archive.py", line 1273, in process_file_chunks
item.chunks.append(chunk_processor(chunk))
^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/borg/archive.py", line 1261, in chunk_processor
chunk_entry = cache.add_chunk(chunk_id, data, stats, wait=False)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/borg/cache.py", line 949, in add_chunk
data = self.key.encrypt(chunk)
^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/borg/crypto/key.py", line 370, in encrypt
next_iv = self.nonce_manager.ensure_reservation(self.cipher.next_iv(),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/borg/crypto/nonces.py", line 79, in ensure_reservation
repo_free_nonce = self.get_repo_free_nonce()
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/borg/crypto/nonces.py", line 38, in get_repo_free_nonce
return self.repository.get_free_nonce()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/borg/repository.py", line 352, in get_free_nonce
return int.from_bytes(unhexlify(fd.read()), byteorder='big')
^^^^^^^^^^^^^^^^^^^^
binascii.Error: Non-hexadecimal digit found
Platform: Linux leia 6.1.0-15-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.66-1 (2023-12-09) x86_64
Linux: Unknown Linux
Borg: 1.2.4 Python: CPython 3.11.2 msgpack: 1.0.3 fuse: llfuse 1.4.1 [pyfuse3,llfuse]
PID: 31422 CWD: /home/mnalis
sys.argv: ['/usr/bin/borg', 'create', '--stats', '--progress', '--noflags', '--noacls', '--noxattrs', '--compression', 'zstd', '--exclude-caches', '--exclude-from', '/etc/exclude.txt', '/media/ladica/backup/borg::{hostname}-{now}', '/']
SSH_ORIGINAL_COMMAND: None
%
Hmm, that does not look like a borg error, but rather that the repo disk fs or disk hw was confused somehow.
Borg never writes such json content to the nonce file - that content rather looks like from a lock file.
You already ran a fsck -f, so maybe also do a smartctl -t long on that disk and afterwards check smart status again using smartctl -a (you can post the output here, if unsure).
Other than that, yes I'ld just remove the defect nonce file from the repo (it's only making troubles as it is now) and retry.
Thanks; long self-test seems to finish without any problems, and SMART results do not look problematic to me:
% sudo smartctl -a /dev/sda
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-15-amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Western Digital Elements / My Passport (USB, AF)
Device Model: WDC WD40NDZW-11BCSS0
Serial Number: WD-WX62DA1PETH6
LU WWN Device Id: 5 0014ee 26a079696
Firmware Version: 01.01A01
User Capacity: 4.000.753.475.584 bytes [4,00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 4800 rpm
Form Factor: 2.5 inches
TRIM Command: Available, deterministic
Device is: In smartctl database 7.3/5319
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Mon Dec 18 03:44:27 2023 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x04) Offline data collection activity
was suspended by an interrupting command from host.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (10320) seconds.
Offline data collection
capabilities: (0x1b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
No Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 28) minutes.
SCT capabilities: (0x30b5) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 253 253 021 Pre-fail Always - 2058
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 266
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 134
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 241
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 200
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 1268
194 Temperature_Celsius 0x0022 119 093 000 Old_age Always - 33
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 133 -
# 2 Short offline Completed without error 00% 128 -
# 3 Extended offline Completed without error 00% 128 -
Selective Self-tests/Logging not supported
%
I've removed the nonce and on first look it seems new backups seem to work just fine, as is restoring new/old data.
That got me wondering how just that nonce file could get corrupted, with the rest of the disk and repository apparently in good condition? I've looked at commit_nonce_reservation which I think is responsible for saving nonce file (note that I'm completely new to borg codebase and not an python expert; my background is more in C) :
https://github.com/borgbackup/borg/blob/4c131059bc9e90753160c07bf26165278aeab69a/src/borg/repository.py#L365-L369
and it seems that it relies on SaveFile to make sure that the data is safely on disk before continuing there.
However if I'm reading that correctly, SaveFile seems to only do close(2), which does not guarantee that data is safely on disk? i.e.
https://github.com/borgbackup/borg/blob/1432f547fe4fe13386b245dcc0d9c2a0880ddfe6/src/borg/platform/base.py#L239-L244
Would it make sense to either:
- explicitly
fsync(2)file data beforeclose(2)inSaveFile__exit__? - explicitly
fsync(2)file data incommit_nonce_reservationafterwrite(2)(if above would be performance killer in other cases where it is unneeded -nonceis not being written all that frequently if I understand correctly, so it should be OK to do it just there?) - I notice that there is also
class SyncFilewhich perhaps might be used fornoncefile (the function descriptions do not seem to explain when one is supposed to use one or the other)?
Need to have a deeper look there.
But I guess a missing fsync does not explain that it showed the data from a completely different file.
Need to have a deeper look there.
Thanks, I've split that fsync(2) suggestion into separate issue https://github.com/borgbackup/borg/issues/7969
But I guess a missing fsync does not explain that it showed the data from a completely different file.
Yeah, not easily explainable. There could be interactions with write reorders and unclean shutdown that lead to that, but that reason for this particular issue has suddenly become much less probable - as the issue just reproduced with exactly the same bug.
Timeline:
- after initial problem reported here, I did
fsck -fandsmartctl -t longto make sure no physical/logical issues on that HDD. -Then, I've removednoncefile and successfully run newborg create, which didn't complain and seemed to complete just fine. (didn't keep logs of it, though, as I didn't expect this to repeat). - I've done (many hours long)
borg check -v -p --repair /media/ladica/backup/borgand there seems to have been one file withPreviously missing file chunk is still missingerror. Check seemed to complete OK:
% (date; sudo time borg check -v -p --repair /media/ladica/backup/borg ; date) | tee ~/tmp/borg-repair-20231218b.txt
Mon Dec 18 15:59:53 CET 2023
This is a potentially dangerous function.
check --repair might lead to data loss (for kinds of corruption it is not
capable of dealing with). BE VERY CAREFUL!
Type 'YES' if you understand this and want to continue: YES
Starting repository check
finished segment check at segment 2788
Starting repository index check
Finished full repository check, no problems found.
Starting archive consistency check...
Enter passphrase for key /media/ladica/backup/borg:
Analyzing archive leia-2022-11-28T23:40:49 (1/100)
Analyzing archive leia-2022-11-29T01:34:42 (2/100)
Analyzing archive skywalker-2022-11-29T14:19:21 (3/100)
Analyzing archive leia-2022-11-29T18:05:52 (4/100)
Analyzing archive leia-2022-11-29T18:30:56 (5/100)
Analyzing archive leia-2022-12-08T13:10:50 (6/100)
Analyzing archive leia-2022-12-09T01:14:12 (7/100)
Analyzing archive leia-2022-12-09T01:28:19 (8/100)
Analyzing archive leia-2022-12-09T19:56:15 (9/100)
Analyzing archive leia-2022-12-09T20:12:02 (10/100)
Analyzing archive leia-2022-12-27T14:40:11 (11/100)
Analyzing archive leia-2023-01-24T22:59:18.checkpoint (12/100)
Analyzing archive leia-2023-01-24T23:05:06 (13/100)
Analyzing archive leia-2023-01-24T23:16:09 (14/100)
Analyzing archive leia-2023-01-25T03:07:01 (15/100)
Analyzing archive leia-2023-01-25T03:12:02 (16/100)
Analyzing archive leia-2023-01-26T14:27:44 (17/100)
Analyzing archive skywalker-2023-01-26T14:31:55 (18/100)
Analyzing archive skywalker-2023-01-26T14:57:18 (19/100)
Analyzing archive leia-2023-01-30T23:12:06 (20/100)
Analyzing archive torres-20230130-backup-2023-01-30T23:39:12 (21/100)
Analyzing archive leia-2023-01-31T05:24:17 (22/100)
Analyzing archive leia-2023-02-06T13:35:42 (23/100)
Analyzing archive leia-2023-02-13T01:10:18 (24/100)
Analyzing archive leia-2023-02-16T17:58:06 (25/100)
Analyzing archive leia-2023-02-17T06:28:50 (26/100)
Analyzing archive leia-2023-03-20T12:06:09 (27/100)
Analyzing archive HuaweiP30-2023-03-20T12:58:27.checkpoint (28/100)
Analyzing archive HuaweiP30-2023-03-20T13:03:13.checkpoint (29/100)
Analyzing archive HuaweiP30-2023-03-20T13:11:30 (30/100)
Analyzing archive torres-owncloud-2023-03-20T14:28:39 (31/100)
Analyzing archive torres-owncloud-2023-03-20T14:33:21.checkpoint (32/100)
Analyzing archive torres-owncloud-2023-03-20T19:12:30.checkpoint (33/100)
torres-owncloud-2023-03-20T19:12:30.checkpoint: home/mnalis/mnt/var/lib/owncloud/data/kkulic/files/InstantUpload/VID_20170617_073952.mp4: Previously missing file chunk is still missing (Byte 1050941-5836528, Chunk 9789d6da073b872b105ae67dfc26efdbce4af2649816c1cb500eb9429e95007f). It has an all-zero replacement chunk already.
torres-owncloud-2023-03-20T19:12:30.checkpoint: home/mnalis/mnt/var/lib/owncloud/data/kkulic/files/InstantUpload/VID_20170617_073952.mp4: Previously missing file chunk is still missing (Byte 5836528-7584655, Chunk d3137814b599f7f2cd68375c30c8f561e4edcb81d6412488ef1fe2eeb6bf44b4). It has an all-zero replacement chunk already.
Analyzing archive torres-owncloud-2023-03-20T23:07:53.checkpoint (34/100)
Analyzing archive torres-owncloud-2023-03-21T02:21:30 (35/100)
torres-owncloud-2023-03-21T02:21:30: home/mnalis/mnt/var/lib/owncloud/data/kkulic/files/InstantUpload/VID_20170617_073952.mp4: Previously missing file chunk is still missing (Byte 1050941-5836528, Chunk 9789d6da073b872b105ae67dfc26efdbce4af2649816c1cb500eb9429e95007f). It has an all-zero replacement chunk already.
torres-owncloud-2023-03-21T02:21:30: home/mnalis/mnt/var/lib/owncloud/data/kkulic/files/InstantUpload/VID_20170617_073952.mp4: Previously missing file chunk is still missing (Byte 5836528-7584655, Chunk d3137814b599f7f2cd68375c30c8f561e4edcb81d6412488ef1fe2eeb6bf44b4). It has an all-zero replacement chunk already.
Analyzing archive torres-owncloud-2023-03-21T13:55:34 (36/100)
Analyzing archive torres-owncloud-2023-03-21T13:57:03 (37/100)
torres-owncloud-2023-03-21T13:57:03: home/mnalis/mnt/var/lib/owncloud/data/kkulic/files/InstantUpload/VID_20170617_073952.mp4: Previously missing file chunk is still missing (Byte 1050941-5836528, Chunk 9789d6da073b872b105ae67dfc26efdbce4af2649816c1cb500eb9429e95007f). It has an all-zero replacement chunk already.
torres-owncloud-2023-03-21T13:57:03: home/mnalis/mnt/var/lib/owncloud/data/kkulic/files/InstantUpload/VID_20170617_073952.mp4: Previously missing file chunk is still missing (Byte 5836528-7584655, Chunk d3137814b599f7f2cd68375c30c8f561e4edcb81d6412488ef1fe2eeb6bf44b4). It has an all-zero replacement chunk already.
Analyzing archive torres-owncloud-2023-03-22T01:12:11.checkpoint (38/100)
Analyzing archive torres-owncloud-2023-03-22T01:22:32 (39/100)
torres-owncloud-2023-03-22T01:22:32: home/mnalis/mnt/var/lib/owncloud/data/kkulic/files/InstantUpload/VID_20170617_073952.mp4: Previously missing file chunk is still missing (Byte 1050941-5836528, Chunk 9789d6da073b872b105ae67dfc26efdbce4af2649816c1cb500eb9429e95007f). It has an all-zero replacement chunk already.
torres-owncloud-2023-03-22T01:22:32: home/mnalis/mnt/var/lib/owncloud/data/kkulic/files/InstantUpload/VID_20170617_073952.mp4: Previously missing file chunk is still missing (Byte 5836528-7584655, Chunk d3137814b599f7f2cd68375c30c8f561e4edcb81d6412488ef1fe2eeb6bf44b4). It has an all-zero replacement chunk already.
Analyzing archive torres-2023-03-22T01:31:44 (40/100)
torres-2023-03-22T01:31:44: home/mnalis/mnt/var/lib/owncloud/data/kkulic/files/InstantUpload/VID_20170617_073952.mp4: Previously missing file chunk is still missing (Byte 1050941-5836528, Chunk 9789d6da073b872b105ae67dfc26efdbce4af2649816c1cb500eb9429e95007f). It has an all-zero replacement chunk already.
torres-2023-03-22T01:31:44: home/mnalis/mnt/var/lib/owncloud/data/kkulic/files/InstantUpload/VID_20170617_073952.mp4: Previously missing file chunk is still missing (Byte 5836528-7584655, Chunk d3137814b599f7f2cd68375c30c8f561e4edcb81d6412488ef1fe2eeb6bf44b4). It has an all-zero replacement chunk already.
Analyzing archive skywalker-2023-03-23T13:46:42 (41/100)
Analyzing archive XiaomiRedmiNote6Pro-2023-03-23T22:13:30 (42/100)
Analyzing archive XiaomiRedmiNote6Pro-2023-03-23T22:14:34 (43/100)
Analyzing archive XiaomiRedmiNote6Pro-2023-03-23T23:03:59 (44/100)
Analyzing archive XiaomiRedmiNote6Pro-2023-03-23T23:07:45 (45/100)
Analyzing archive XiaomiRedmiNote6Pro-2023-03-24T00:32:48 (46/100)
Analyzing archive XiaomiRedmiNote6Pro-2023-03-24T01:54:45.checkpoint (47/100)
Analyzing archive XiaomiRedmiNote6Pro-2023-03-24T01:57:19 (48/100)
Analyzing archive XiaomiRedmiNote6Pro-2023-03-24T02:04:04 (49/100)
Analyzing archive leia-2023-03-24T03:05:55 (50/100)
Analyzing archive leia-2023-04-04T16:54:39 (51/100)
Analyzing archive leia-2023-04-18T04:12:09 (52/100)
Analyzing archive leia-2023-05-26T15:15:48 (53/100)
Analyzing archive HuaweiP30-2023-05-26T15:25:10.checkpoint (54/100)
Analyzing archive HuaweiP30-2023-05-26T15:28:22 (55/100)
Analyzing archive HuaweiP30-2023-05-26T16:18:30 (56/100)
Analyzing archive skywalker-2023-05-26T16:37:13 (57/100)
Analyzing archive leia-2023-06-24T16:33:16 (58/100)
Analyzing archive HuaweiP30-2023-06-24T16:44:22.checkpoint (59/100)
Analyzing archive HuaweiP30-2023-06-24T17:13:04 (60/100)
Analyzing archive HuaweiP30-2023-06-24T17:14:25 (61/100)
Analyzing archive HuaweiP30-2023-06-24T18:28:13 (62/100)
Analyzing archive leia-2023-08-27T20:03:19 (63/100)
Analyzing archive HuaweiP30-2023-08-27T20:16:53.checkpoint (64/100)
Analyzing archive HuaweiP30-2023-08-27T23:16:10.checkpoint (65/100)
Analyzing archive HuaweiP30-2023-08-28T15:53:26 (66/100)
Analyzing archive HuaweiP30-2023-08-28T16:52:36 (67/100)
Analyzing archive HuaweiP30-2023-08-28T17:07:45 (68/100)
Analyzing archive HuaweiP30-2023-08-28T17:15:06 (69/100)
Analyzing archive skywalker-2023-08-28T17:57:29 (70/100)
Analyzing archive leia-2023-09-25T17:15:40 (71/100)
Analyzing archive skywalker-2023-10-01T11:19:00 (72/100)
Analyzing archive ivana-stari-acer-laptop-2023-10-01T15:39:47 (73/100)
Analyzing archive HuaweiP30-2023-10-01T23:54:59 (74/100)
Analyzing archive HuaweiP30-2023-10-02T02:50:38 (75/100)
Analyzing archive leia-2023-10-02T04:20:24.checkpoint (76/100)
Analyzing archive leia-2023-10-02T16:06:55 (77/100)
Analyzing archive Xiaomi_SD64-2023-10-02T16:29:54.checkpoint (78/100)
Analyzing archive XiaomiRedmiNote6Pro_SD64-2023-10-02T16:31:02 (79/100)
Analyzing archive XiaomiRedmiNote6Pro_SD64-2023-10-02T16:31:24 (80/100)
Analyzing archive leia-2023-10-02T18:08:01 (81/100)
Analyzing archive XiaomiRedmiNote6Pro_SD64-2023-10-02T18:21:07 (82/100)
Analyzing archive XiaomiRedmiNote6Pro_internal1-2023-10-02T23:19:39 (83/100)
Analyzing archive XiaomiRedmiNote6Pro_internal2-2023-10-02T23:24:03 (84/100)
Analyzing archive leia-2023-10-08T01:36:16 (85/100)
Analyzing archive leia-2023-10-27T11:29:29 (86/100)
Analyzing archive HuaweiP30-2023-10-27T11:43:05 (87/100)
Analyzing archive HuaweiP30-2023-10-27T12:54:38 (88/100)
Analyzing archive HuaweiP30-2023-10-27T13:28:21 (89/100)
Analyzing archive HuaweiP30-2023-10-27T13:32:13 (90/100)
Analyzing archive leia-2023-11-14T13:03:32 (91/100)
Analyzing archive leia-2023-12-02T08:18:10 (92/100)
Analyzing archive leia-2023-12-02T08:49:52 (93/100)
Analyzing archive leia-2023-12-02T09:13:39 (94/100)
Analyzing archive leia-2023-12-05T21:58:52 (95/100)
Analyzing archive leia-2023-12-11T21:31:49 (96/100)
Analyzing archive leia-2023-12-11T22:09:33 (97/100)
Analyzing archive leia-2023-12-13T22:57:25 (98/100)
Analyzing archive leia-2023-12-17T21:37:45 (99/100)
Analyzing archive leia-2023-12-17T21:50:01 (100/100)
Deleting 0 orphaned and 194 superseded objects...
Finished deleting orphaned/superseded objects.
Writing Manifest.
Committing repo.
Archive consistency check complete, no problems found.
1171.22user 657.82system 3:14:47elapsed 15%CPU (0avgtext+0avgdata 397012maxresident)k
1932064520inputs+487960outputs (60major+704024minor)pagefaults 0swaps
Mon Dec 18 19:14:41 CET 2023
- just to be sure, I followed it with another check, this time with verify-data
borg check -v -p --verify-data /media/ladica/backup/borg. It also worked for many hours, and then bailed out with samenoncecorruption -- even the content looks like same JSON as before.
% (date; sudo time borg check -v -p --verify-data /media/ladica/backup/borg ; date) | tee -a ~/tmp/borg-repair-20231218c.txt
[sudo] password for mnalis: Mon Dec 18 19:20:38 CET 2023
Starting repository check
finished segment check at segment 2792
Starting repository index check
Index object count match.
Finished full repository check, no problems found.
Starting archive consistency check...
Enter passphrase for key /media/ladica/backup/borg:
Starting cryptographic data integrity verification...
Finished cryptographic data integrity verification, verified 2258954 chunks with 0 integrity errors.
Analyzing archive leia-2022-11-28T23:40:49 (1/100)
Local Exception
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/borg/archiver.py", line 5213, in main
exit_code = archiver.run(args)
^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/borg/archiver.py", line 5144, in run
return set_ec(func(args))
^^^^^^^^^^
File "/usr/lib/python3/dist-packages/borg/archiver.py", line 183, in wrapper
return method(self, args, repository=repository, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/borg/archiver.py", line 347, in do_check
if not args.repo_only and not ArchiveChecker().check(
^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/borg/archive.py", line 1647, in check
self.rebuild_refcounts(archive=archive, first=first, last=last, sort_by=sort_by, glob=glob)
File "/usr/lib/python3/dist-packages/borg/archive.py", line 2059, in rebuild_refcounts
items_buffer.add(item)
File "/usr/lib/python3/dist-packages/borg/archive.py", line 357, in add
self.flush()
File "/usr/lib/python3/dist-packages/borg/archive.py", line 385, in flush
self.chunks.append(self.write_chunk(chunk))
^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/borg/archive.py", line 1846, in add_callback
cdata = self.key.encrypt(chunk)
^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/borg/crypto/key.py", line 370, in encrypt
next_iv = self.nonce_manager.ensure_reservation(self.cipher.next_iv(),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/borg/crypto/nonces.py", line 79, in ensure_reservation
repo_free_nonce = self.get_repo_free_nonce()
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/borg/crypto/nonces.py", line 38, in get_repo_free_nonce
return self.repository.get_free_nonce()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/borg/repository.py", line 352, in get_free_nonce
return int.from_bytes(unhexlify(fd.read()), byteorder='big')
^^^^^^^^^^^^^^^^^^^^
binascii.Error: Non-hexadecimal digit found
Platform: Linux leia 6.1.0-15-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.66-1 (2023-12-09) x86_64
Linux: Unknown Linux
Borg: 1.2.4 Python: CPython 3.11.2 msgpack: 1.0.3 fuse: llfuse 1.4.1 [pyfuse3,llfuse]
PID: 32758 CWD: /home/mnalis
sys.argv: ['/usr/bin/borg', 'check', '-v', '-p', '--verify-data', '/media/ladica/backup/borg']
SSH_ORIGINAL_COMMAND: None
Command exited with non-zero status 2
3688.13user 1156.81system 7:20:27elapsed 18%CPU (0avgtext+0avgdata 397468maxresident)k
3845164672inputs+48outputs (35major+148304minor)pagefaults 0swaps
Tue Dec 19 02:41:12 CET 2023
% sudo hd /media/ladica/backup/borg/nonce
00000000 7b 22 65 78 63 6c 75 73 69 76 65 22 3a 20 5b 5d |{"exclusive": []|
00000010
Any ideas how to debug this further? Other commands/flags to run with, or some manual logging code to add somewhere? This repeat of the same issue might indicate some software bug/race condition somewhere, as it seems to me quite unlikely that repeating hardware issue would result exactly in same content ending up in the nonce file.
Oh, it gets more weird. :-)
Could you copy the repo to a different disk / different enclosure and reproduce with that?
BTW, SaveFile internally uses SyncFile which does flush and os.fdatasync (and falls back to os.fsync if platform has no os.fdatasync) internally before closing the file.
One approach to debug this would be to add a logger.debug() to the locking code. IIRC it is the lock.roster file that has this json content.
Guess the filesystem confusion seen here has nothing to do with borg, so we can't do anything about it.
But, what we can do is to produce a better error message in such a case.
Guess the filesystem confusion seen here has nothing to do with borg, so we can't do anything about it.
Yeah. Unfortunately (Unfortunately for sheding some light on the issue, fortunately for me otherwise :smiley: ) I wasn't able to reproduce the bug in some dozen attempts in the meantime, so could not try to debug it either . While I agree that this part of the borg code looks fine; I still find it somewhat suspicious that bug repeated twice is such specific way with same content in wrong place.
But as you say, that might quite well be filesystem bug too, or python bug, or HDD firmware bug or something else. If it ever repeats and I get it to repeat at least somewhat reliably, I'll try to reproduce and let you know. Until such time, feel free to close this issue (when related things are done)
But, what we can do is to produce a better error message in such a case.
That would be nice, yes
done.