Crash during prune due to corrupted item metadata
Since a couple of days I got a crash with the exception below when I try to prune my repo. No idea what changed (It already started with borg 1.4.0).
I added a debug print to see what is in the item Python complains about, which produces the first line of output below. Seems the key group got messed up somehow.
{
b'chunks': ((b'Cp\xb6)JxY\xdf\xeb\xdbX\x05\xde\x1f]M\rp\xb0\xeb\xaf\xb6\xcf\x8e\xc5\x18\x870\xae"\xeb\x10', 1223, 1266),),
b'ctime': 1741265229576867330,
b'gid': 1000,
b'gro\xf5p': b'omgold', # <--- key corrupted
b'hardlink_master': False,
b'mode': 33060,
b'mtime': 1741265229576867330,
b'path': b'home/omgold/kernel/.git/objects/9c/58c0ac1a932ec9b2a7399ceb9361f7a658e376',
b'size': 1223,
b'uid': 1000,
b'user': b'omgold'
}
Local Exception
Traceback (most recent call last):
File "/usr/lib/python3.13/site-packages/borg/archiver.py", line 5465, in main
exit_code = archiver.run(args)
File "/usr/lib/python3.13/site-packages/borg/archiver.py", line 5383, in run
rc = func(args)
File "/usr/lib/python3.13/site-packages/borg/archiver.py", line 200, in wrapper
return method(self, args, repository=repository, **kwargs)
File "/usr/lib/python3.13/site-packages/borg/archiver.py", line 1607, in do_prune
archive.delete(stats, forced=args.forced)
~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.13/site-packages/borg/archive.py", line 1010, in delete
item = Item(internal_dict=item)
File "src/borg/item.pyx", line 48, in borg.item.PropDict.__init__
File "src/borg/item.pyx", line 69, in borg.item.PropDict.update_internal
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf5 in position 3: invalid start byte
Platform: Linux home 6.12.16-1-lts #1 SMP PREEMPT_DYNAMIC Fri, 21 Feb 2025 19:20:31 +0000 x86_64
Linux: Unknown Linux
Borg: 1.4.1 Python: CPython 3.13.3 msgpack: 1.0.5 fuse: llfuse 1.5.0 [pyfuse3,llfuse]
PID: 3311491 CWD: /home/omgold
sys.argv: ['/usr/bin/borg', 'prune', '--info', '--keep-hourly=176', '--keep-daily=30', '--keep-weekly=50', '--keep-monthly=60', '--keep-yearly=100', 'borg@backup:home']
SSH_ORIGINAL_COMMAND: None
This looks like a RAM issue, please run at least 1 full pass of memtest86+ on your borg client machine.
>>> hex(int('u'.encode()[0]))
'0x75'
So, that is a single bit flipped to make 0xf5 from 0x75.
Could be also CPU or mainboard I guess.
Also please check /etc/group file how the group name for gid 1000 looks like.
Running memtest is a bit inconvenient right now, but running mprime (which in my experience is a much more reliable test anyway) doesn't show any issues.
The entry in /etc/group looks fine (no weird bytes, checked with hexdump).
Note that the situation is reproducible. I run borg hourly and it fails every time for 3 days.
Isn't mprime rather a CPU load test? My suspicion is that your RAM might be unreliable.
Of course it could also be that the RAM is ok and working normally. But working normally includes rare bitflips if you don't have a ECC memory system (RAM + wires + CPU ECC support).
If you suspect it is a borg problem and we want to rule out a hw issue, please reproduce the issue on another machine.
Also, please note that the reason why it is reproducible could well be because your repo already contains corrupted metadata (so the corruption event was at borg create time and persisted in the archived item metadata). So, each time you try to process that archive (that item), borg will blow up with that exception.
The easiest way to get rid of that would be to delete the affected archive.
After making sure your hw works correctly, running a borg check / borg check --repair is a good idea.
Isn't mprime rather a CPU load test? My suspicion is that your RAM might be unreliable.
mprime absolutely is a memory test (or at least works great as one). I always use it for that purpose. You can actually test it by messing with your memory timings until you have a slight instability (such that your system boots fine). Chances are that mprime catches the issue in orders of magnitude less time than memtest86.
Also, please note that the reason why it is reproducible could well be because your repo already contains corrupted metadata (so the corruption event was at
borg createtime and persisted in the archived item metadata). So, each time you try to process that archive (that item), borg will blow up with that exception.
That would make sense to me. See below.
The easiest way to get rid of that would be to delete the affected archive.
Okay, but how do I find out which archive is the affected one?
After making sure your hw works correctly, running a
borg check/borg check --repairis a good idea.
borg check indeed shows a problem here:
$ /usr/bin/borg check borg@backup:home
Did not get expected metadata dict when unpacking item metadata (invalid keys: chu�ks) [chunk: 000313_33cd277c72f86f404b6af96b69d6001dd8055d2de1ab3b3a22db0a0420348a2d]
Did not get expected metadata dict when unpacking item metadata (invalid keys: m�de) [chunk: 000416_e33a2ae5dc13ce5e9647a96ad170120ae6bdf3ec7515f50ba97b2aca5fc69260]
Did not get expected metadata dict when unpacking item metadata (invalid keys: gro�p) [chunk: 000213_c09aee27ef9bb4bee572ddccb82e3863e3373f4f3599ce55512d1c8aaa0b9d6b]
Did not get expected metadata dict when unpacking item metadata (missing required keys: mtime) [chunk: 000366_7b2942e4999a3f6153c4589d50a9761a316673d1e0ed07b5bded8d4fd775b742]
Did not get expected metadata dict when unpacking item metadata (invalid keys: grou�) [chunk: 000420_3bc6c2f9318743e87611bd46c25d9e86e2c2b432c88028bd05c91f05b4676812]
Quite some corruption there, so it seems more than the normal, rare, non-ECC RAM bitflip.
Guess I would:
- fix the hw issue
- run
borg check --verify-data --repair -v ...
Yesterday I rolled back the server state (from a snapshot) to 3-day-old state, which borg check didn't complain about. The next three backups worked fine, then the problem appeared again...
If you don't fix the hw, issues will reappear.
After the problem reappeared, I ran borg check --repair. Until now it is quiet.
If you don't fix the hw, issues will reappear.
I don't understand how this could be a hardware issue. The same bitflip at the same position even after a reboot (and new borg process of course)?
A one-time transient glitch that corrupted the data and made it persistent on disk, yes. Maybe the repair solved it. Let's see...
If you do a lot of activity on that machine while memory is not working correctly, you risk all sorts of malfunctions and even persistent corruption, not just in the borg repo.