borg icon indicating copy to clipboard operation
borg copied to clipboard

Crash during prune due to corrupted item metadata

Open omgold opened this issue 8 months ago • 10 comments

Since a couple of days I got a crash with the exception below when I try to prune my repo. No idea what changed (It already started with borg 1.4.0).

I added a debug print to see what is in the item Python complains about, which produces the first line of output below. Seems the key group got messed up somehow.

{
b'chunks': ((b'Cp\xb6)JxY\xdf\xeb\xdbX\x05\xde\x1f]M\rp\xb0\xeb\xaf\xb6\xcf\x8e\xc5\x18\x870\xae"\xeb\x10', 1223, 1266),),
b'ctime': 1741265229576867330,
b'gid': 1000,
b'gro\xf5p': b'omgold',         #  <--- key corrupted
b'hardlink_master': False,
b'mode': 33060,
b'mtime': 1741265229576867330,
b'path': b'home/omgold/kernel/.git/objects/9c/58c0ac1a932ec9b2a7399ceb9361f7a658e376',
b'size': 1223,
b'uid': 1000,
b'user': b'omgold'
}

Local Exception
Traceback (most recent call last):
  File "/usr/lib/python3.13/site-packages/borg/archiver.py", line 5465, in main
    exit_code = archiver.run(args)
  File "/usr/lib/python3.13/site-packages/borg/archiver.py", line 5383, in run
    rc = func(args)
  File "/usr/lib/python3.13/site-packages/borg/archiver.py", line 200, in wrapper
    return method(self, args, repository=repository, **kwargs)
  File "/usr/lib/python3.13/site-packages/borg/archiver.py", line 1607, in do_prune
    archive.delete(stats, forced=args.forced)
    ~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.13/site-packages/borg/archive.py", line 1010, in delete
    item = Item(internal_dict=item)
  File "src/borg/item.pyx", line 48, in borg.item.PropDict.__init__
  File "src/borg/item.pyx", line 69, in borg.item.PropDict.update_internal
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf5 in position 3: invalid start byte

Platform: Linux home 6.12.16-1-lts #1 SMP PREEMPT_DYNAMIC Fri, 21 Feb 2025 19:20:31 +0000 x86_64
Linux: Unknown Linux  
Borg: 1.4.1  Python: CPython 3.13.3 msgpack: 1.0.5 fuse: llfuse 1.5.0 [pyfuse3,llfuse]
PID: 3311491  CWD: /home/omgold
sys.argv: ['/usr/bin/borg', 'prune', '--info', '--keep-hourly=176', '--keep-daily=30', '--keep-weekly=50', '--keep-monthly=60', '--keep-yearly=100', 'borg@backup:home']
SSH_ORIGINAL_COMMAND: None

omgold avatar Apr 22 '25 07:04 omgold

This looks like a RAM issue, please run at least 1 full pass of memtest86+ on your borg client machine.

>>> hex(int('u'.encode()[0]))
'0x75'

So, that is a single bit flipped to make 0xf5 from 0x75.

Could be also CPU or mainboard I guess.

Also please check /etc/group file how the group name for gid 1000 looks like.

ThomasWaldmann avatar Apr 22 '25 11:04 ThomasWaldmann

Running memtest is a bit inconvenient right now, but running mprime (which in my experience is a much more reliable test anyway) doesn't show any issues.

The entry in /etc/group looks fine (no weird bytes, checked with hexdump).

Note that the situation is reproducible. I run borg hourly and it fails every time for 3 days.

omgold avatar Apr 22 '25 12:04 omgold

Isn't mprime rather a CPU load test? My suspicion is that your RAM might be unreliable.

Of course it could also be that the RAM is ok and working normally. But working normally includes rare bitflips if you don't have a ECC memory system (RAM + wires + CPU ECC support).

If you suspect it is a borg problem and we want to rule out a hw issue, please reproduce the issue on another machine.

ThomasWaldmann avatar Apr 22 '25 13:04 ThomasWaldmann

Also, please note that the reason why it is reproducible could well be because your repo already contains corrupted metadata (so the corruption event was at borg create time and persisted in the archived item metadata). So, each time you try to process that archive (that item), borg will blow up with that exception.

The easiest way to get rid of that would be to delete the affected archive.

After making sure your hw works correctly, running a borg check / borg check --repair is a good idea.

ThomasWaldmann avatar Apr 22 '25 13:04 ThomasWaldmann

Isn't mprime rather a CPU load test? My suspicion is that your RAM might be unreliable.

mprime absolutely is a memory test (or at least works great as one). I always use it for that purpose. You can actually test it by messing with your memory timings until you have a slight instability (such that your system boots fine). Chances are that mprime catches the issue in orders of magnitude less time than memtest86.

Also, please note that the reason why it is reproducible could well be because your repo already contains corrupted metadata (so the corruption event was at borg create time and persisted in the archived item metadata). So, each time you try to process that archive (that item), borg will blow up with that exception.

That would make sense to me. See below.

The easiest way to get rid of that would be to delete the affected archive.

Okay, but how do I find out which archive is the affected one?

After making sure your hw works correctly, running a borg check / borg check --repair is a good idea.

borg check indeed shows a problem here:

$ /usr/bin/borg check borg@backup:home
Did not get expected metadata dict when unpacking item metadata (invalid keys: chu�ks) [chunk: 000313_33cd277c72f86f404b6af96b69d6001dd8055d2de1ab3b3a22db0a0420348a2d]
Did not get expected metadata dict when unpacking item metadata (invalid keys: m�de) [chunk: 000416_e33a2ae5dc13ce5e9647a96ad170120ae6bdf3ec7515f50ba97b2aca5fc69260]
Did not get expected metadata dict when unpacking item metadata (invalid keys: gro�p) [chunk: 000213_c09aee27ef9bb4bee572ddccb82e3863e3373f4f3599ce55512d1c8aaa0b9d6b]
Did not get expected metadata dict when unpacking item metadata (missing required keys: mtime) [chunk: 000366_7b2942e4999a3f6153c4589d50a9761a316673d1e0ed07b5bded8d4fd775b742]
Did not get expected metadata dict when unpacking item metadata (invalid keys: grou�) [chunk: 000420_3bc6c2f9318743e87611bd46c25d9e86e2c2b432c88028bd05c91f05b4676812]

omgold avatar Apr 22 '25 14:04 omgold

Quite some corruption there, so it seems more than the normal, rare, non-ECC RAM bitflip.

Guess I would:

  • fix the hw issue
  • run borg check --verify-data --repair -v ...

ThomasWaldmann avatar Apr 22 '25 18:04 ThomasWaldmann

Yesterday I rolled back the server state (from a snapshot) to 3-day-old state, which borg check didn't complain about. The next three backups worked fine, then the problem appeared again...

omgold avatar Apr 23 '25 05:04 omgold

If you don't fix the hw, issues will reappear.

ThomasWaldmann avatar Apr 23 '25 16:04 ThomasWaldmann

After the problem reappeared, I ran borg check --repair. Until now it is quiet.

If you don't fix the hw, issues will reappear.

I don't understand how this could be a hardware issue. The same bitflip at the same position even after a reboot (and new borg process of course)?

A one-time transient glitch that corrupted the data and made it persistent on disk, yes. Maybe the repair solved it. Let's see...

omgold avatar Apr 24 '25 05:04 omgold

If you do a lot of activity on that machine while memory is not working correctly, you risk all sorts of malfunctions and even persistent corruption, not just in the borg repo.

ThomasWaldmann avatar Apr 24 '25 09:04 ThomasWaldmann