borg icon indicating copy to clipboard operation
borg copied to clipboard

borg check errors after upgrade to 1.2.0

Open jdchristensen opened this issue 2 years ago • 96 comments

Have you checked borgbackup docs, FAQ, and open Github issues?

Yes.

Is this a BUG / ISSUE report or a QUESTION?

Possible bug.

System information. For client/server mode post info for both machines.

Your borg version (borg -V).

1.2.0 on all clients and servers. Previous version was 1.1.17 (not 1.1.7, as I wrote on the mailing list).

Operating system (distribution) and version.

Debian buster and Ubuntu 20.04 on servers. Debian buster, Ubuntu 20.04 and Ubuntu 21.10 on clients.

Hardware / network configuration, and filesystems used.

Multiple local and remote clients accessing each repository. Repositories are on ext4, on RAID 1 mdadm devices, with spinning disks underlying them. The Debian server also uses lvm.

How much data is handled by borg?

The repos are all around 100GB in size, with up to 400 archives each. The repositories have been in use for many years.

Full borg commandline that lead to the problem (leave away excludes and passwords)

borg check /path/to/repo [more details below]

Describe the problem you're observing.

borg check shows errors on three different repositories on two different machines. See below for details.

Can you reproduce the problem? If so, describe how. If not, describe troubleshooting steps you took before opening the issue.

Yes, borg check shows the same errors when run again.

Include any warning/errors/backtraces from the system logs

I upgraded from borg 1.1.17 to 1.2.0 on several different systems on about April 9. On May 9, my monthly "borg check" runs gave errors on three repositories on two systems. Note that I use the setup where several clients do their backups into the same repositories. I don't have any non-shared repositories for comparison.

At the time of the upgrade from 1.1.17 to 1.2.0, I ran borg compact --cleanup-commits ... followed by borg check ... on all repos. There were no errors then. After that, I run borg compact without --cleanup-commits followed by borg check once per month. The errors occurred at the one month mark.

System 1 runs Ubuntu 20.04. Two of the three repos on this machine now have errors:

# borg check /Backups/borg/home.borg
Index object count mismatch.
committed index: 1166413 objects
rebuilt index:   1166414 objects
ID: 8a158ba7fdfae9b1373063a5bb5ea8ea6698c93ed7feff89ca6ff0a3c8842ebd
rebuilt index: (18596, 199132336) committed index: <not found>
Finished full repository check, errors found.

# ls -l /Backups/borg/home.borg/data/37
total 1453100
-rw------- 2 bu bu 201259844 Dec  8  2020 18596
-rw------- 2 bu bu 185611530 Dec 12  2020 18651
-rw------- 2 bu bu 125106377 Dec 25  2020 18858
-rw------- 2 bu bu 524318301 Dec 26  2020 18874
-rw------- 2 bu bu 193813842 Dec 30  2020 18940
-rw------- 2 bu bu 116657254 Dec 30  2020 18945
-rw------- 2 bu bu 141181725 Dec 31  2020 18953

# borg check /Backups/borg/system.borg
Index object count mismatch.
committed index: 2324200 objects
rebuilt index:   2324202 objects
ID: 1e20354918f4fdeb9cc0d677c28dffe1a383dd1b0db11ebcbc5ffb809d3c2b8a
rebuilt index: (24666, 60168) committed index: <not found>
ID: d9c516b5bf53f661a1a9d2ada08c8db7c33a331713f23e058cd6969982728157
rebuilt index: (3516, 138963001) committed index: <not found>
Finished full repository check, errors found.

# ls -l /Backups/borg/system.borg/data/49
total 3316136
-rw------- 2 bu bu 500587725 Oct  5  2021 24555
-rw------- 2 bu bu 168824081 Oct  8  2021 24603
-rw------- 2 bu bu 116475028 Oct  9  2021 24619
-rw------- 2 bu bu 107446533 Oct 11  2021 24634
-rw------- 2 bu bu 252958665 Oct 12  2021 24666
-rw------- 2 bu bu 124871243 Oct 19  2021 24777
-rw------- 2 bu bu 277627834 Oct 19  2021 24793
-rw------- 2 bu bu 231932763 Oct 21  2021 24835
-rw------- 2 bu bu 114031902 Oct 22  2021 24847
-rw------- 2 bu bu 127020577 Oct 26  2021 24899
-rw------- 2 bu bu 220293895 Oct 26  2021 24907
-rw------- 2 bu bu 113238393 Oct 27  2021 24933
-rw------- 2 bu bu 525154704 Oct 27  2021 24941
-rw------- 2 bu bu 291472023 Oct 27  2021 24943
-rw------- 2 bu bu 223721033 Oct 30  2021 24987

# ls -l /Backups/borg/system.borg/data/7
total 1200244
-rw------- 2 bu bu 524615544 Feb  4  2018 3516
-rw------- 2 bu bu 145502511 Feb  5  2018 3529
-rw------- 2 bu bu 266037549 Feb 21  2018 3740
-rw------- 2 bu bu 292869056 Mar 14  2018 3951

System 2 runs Debian buster. One of the three repos on this machine now has errors:

# borg check /Backups/borg/system.borg
Index object count mismatch.
committed index: 2052187 objects
rebuilt index:   2052188 objects
ID: 6b734ed388e7e086af7107847c6b6d3d34a29c20e7e539ded71b32606cb857bd
rebuilt index: (946, 15871355) committed index: <not found>
Finished full repository check, errors found.

# ls -l /Backups/borg/system.borg/data/1
total 205308
-rw------- 1 bu bu 210234581 Jun 20  2017 946

I have used borg on these systems for years, and no hardware has changed recently. System 1 has the repos on a RAID 1 mdadm device with two SATA spinning disks. System 2 also has the repos on RAID 1 mdadm devices with two SATA disks, with lvm as a middle layer. In both cases, smartctl shows no issues for any of the drives, and memtester also shows no errors.

Since the errors have happened on different machines within a month of upgrading to 1.2.0, I am concerned that this is a borg issue rather than a hardware issue. It is also suspicious to me that the error is the same in all cases, with a committed index not found. Hardware errors tend to produce garbage.

I have not run repair yet. Is there anything I should do before running repair to try to figure out the issue?

jdchristensen avatar May 09 '22 12:05 jdchristensen

I should also mention that I don't use encryption with any of these repos.

jdchristensen avatar May 09 '22 12:05 jdchristensen

Thanks for the detailed bug report.

ThomasWaldmann avatar May 09 '22 13:05 ThomasWaldmann

BTW, do you use borg break-lock or --bypass-lock option in your automated scripts?

ThomasWaldmann avatar May 09 '22 13:05 ThomasWaldmann

I have never used break-lock or bypass-lock. I updated the top post to show the correct directories.

jdchristensen avatar May 09 '22 13:05 jdchristensen

I have made a copy of one of the repos. Can you tell me exactly how to run the borg debug get-obj command? Where will it put the output?

jdchristensen avatar May 09 '22 13:05 jdchristensen

% borg debug get-obj --help
usage: borg debug get-obj REPOSITORY ID PATH

thus, for the first repo:

borg debug get-obj /Backups/borg/home.borg \
    8a158ba7fdfae9b1373063a5bb5ea8ea6698c93ed7feff89ca6ff0a3c8842ebd \
    8a158ba7fdfae9b1373063a5bb5ea8ea6698c93ed7feff89ca6ff0a3c8842ebd.chunk

But likely it will only work after you have repaired the repo index.

ThomasWaldmann avatar May 09 '22 13:05 ThomasWaldmann

The directory listings did not reveal anything special. But considering the described circumstances this looks like a bug.

ThomasWaldmann avatar May 09 '22 13:05 ThomasWaldmann

In case we do not find the issue otherwise, maybe keep an unmodified copy of at least one of the repos until this is resolved.

ThomasWaldmann avatar May 09 '22 14:05 ThomasWaldmann

I made a full copy of the third repository above (on the Debian system) and made hard link copies of the first two repositories. I'm now doing repairs on two of the repositories. When they complete, I will include the output, and do the get-obj check.

jdchristensen avatar May 09 '22 14:05 jdchristensen

On the Debian system:

# borg check -v --repair /Backups/borg/system.borg 
This is a potentially dangerous function.
check --repair might lead to data loss (for kinds of corruption it is not
capable of dealing with). BE VERY CAREFUL!

Type 'YES' if you understand this and want to continue: YES
Starting repository check
finished segment check at segment 38108
Starting repository index check
Finished full repository check, no problems found.
Starting archive consistency check...
Analyzing archive estrela-system-20170729-04:32:29 (1/236)
...
Analyzing archive boots-system-20220509-06:32:25 (236/236)
1 orphaned objects found!
Deleting 1 orphaned and 236 superseded objects...
Finished deleting orphaned/superseded objects.
Writing Manifest.
Committing repo.
Archive consistency check complete, problems found.

On the Ubuntu system:

# borg check -v --repair /Backups/borg/home.borg
This is a potentially dangerous function.
check --repair might lead to data loss (for kinds of corruption it is not
capable of dealing with). BE VERY CAREFUL!

Type 'YES' if you understand this and want to continue: YES
Starting repository check
finished segment check at segment 32373
Starting repository index check
Finished full repository check, no problems found.
Starting archive consistency check...
Analyzing archive estrela-home-20170329-10:51:14 (1/405)
...
Analyzing archive yogi-home-20220509-02:01:01 (405/405)
1 orphaned objects found!
Deleting 1 orphaned and 420 superseded objects...
Finished deleting orphaned/superseded objects.
Writing Manifest.
Committing repo.
Archive consistency check complete, problems found.

jdchristensen avatar May 09 '22 15:05 jdchristensen

On the Debian system:

# borg debug get-obj /Backups/borg/system.borg 6b734ed388e7e086af7107847c6b6d3d34a29c20e7e539ded71b32606cb857bd 6b734ed388e7e086af7107847c6b6d3d34a29c20e7e539ded71b32606cb857bd.chunk
object 6b734ed388e7e086af7107847c6b6d3d34a29c20e7e539ded71b32606cb857bd not found.

# ls -l /Backups/borg/system.borg/data/1
total 205308
-rw------- 1 bu bu 210234581 Jun 20  2017 946

On the Ubuntu system:

# borg debug get-obj /Backups/borg/home.borg 8a158ba7fdfae9b1373063a5bb5ea8ea6698c93ed7feff89ca6ff0a3c8842ebd 8a158ba7fdfae9b1373063a5bb5ea8ea6698c93ed7feff89ca6ff0a3c8842ebd.chunk
object 8a158ba7fdfae9b1373063a5bb5ea8ea6698c93ed7feff89ca6ff0a3c8842ebd not found.

# ls -l /Backups/borg/home.borg/data/37
total 1453100
-rw------- 2 bu bu 201259844 Dec  8  2020 18596
-rw------- 2 bu bu 185611530 Dec 12  2020 18651
-rw------- 2 bu bu 125106377 Dec 25  2020 18858
-rw------- 2 bu bu 524318301 Dec 26  2020 18874
-rw------- 2 bu bu 193813842 Dec 30  2020 18940
-rw------- 2 bu bu 116657254 Dec 30  2020 18945
-rw------- 2 bu bu 141181725 Dec 31  2020 18953

The repair said that orphans were being deleted, so maybe that is why they don't show up? And maybe the segment files are still there because I haven't run compact?

jdchristensen avatar May 09 '22 15:05 jdchristensen

I get the same "object ... not found" message if I try get-obj on the (unrepaired) copy of the Debian repo.

jdchristensen avatar May 09 '22 15:05 jdchristensen

OK, so the 1 missing index entry was an orphan chunk. That takes quite a bit of severity off from this ticket. :-)

It can't be found in the index before repair, because it is not in the index (maybe it never was added to the index).

It can't be found in the index after repair, because it was determined orphaned (no archive uses this object) and thus removed.

So what remains to be determined is what borg operation creates this orphan chunk.

ThomasWaldmann avatar May 09 '22 16:05 ThomasWaldmann

What could be done to determine the contents of the orphan chunk is to seek to the given offset into the given segment file and look what's there.

ThomasWaldmann avatar May 09 '22 16:05 ThomasWaldmann

I tried looking at that offset in the file, but the chunks are compressed, so it wasn't obvious. If there's some way to extract the chunk and decompress it, let me know.

About the cause: would network errors cause orphaned chunks? I do occasionally see backups that don't successfully complete due to a network error. But I thought that further repo operations and/or borg compact would cleanup such things.

What is particularly odd is that none of those segment files has a recent date. One of them is from 2017! Maybe a chunk in there was somehow left after a prune operation? Maybe a prune operation had a network issue?

jdchristensen avatar May 09 '22 16:05 jdchristensen

I'm also puzzled that the segment file is still there after the borg repair. Maybe the chunk will get removed after a borg compact? Or maybe these chunks are small enough that borg decides to leave them in place?

jdchristensen avatar May 09 '22 16:05 jdchristensen

If a small not-used-anymore (== "deleted") chunk sits in a segment file with other still-used chunks, it might well stay there. borg computes a ratio deleted/total size and if the ratio is below the threshold, borg compact won't compact the segment.

in recent borg, you can also give borg compact a threshold of 0 (but that might move a lot of data around) for not much space saving.

ThomasWaldmann avatar May 09 '22 16:05 ThomasWaldmann

BTW, as long as the chunk is logically deleted, it won't show as orphan.

  • PUT x, DEL x and nothing uses x == logically deleted, not an orphan
  • PUT x and nothing uses x == orphan

ThomasWaldmann avatar May 09 '22 16:05 ThomasWaldmann

You could try this (it should decrypt and decompress as needed). It dumps ALL objects, so needs quite some space / time.

mkdir myrepo-dump
cd myrepo-dump
borg debug dump-repo-objs --ghost ../myrepo

ThomasWaldmann avatar May 09 '22 17:05 ThomasWaldmann

From the borg check docs:

In repair mode, when all the archives were checked, orphaned chunks are deleted
from the repo. One cause of orphaned chunks are input file related errors (like
read errors) in the archive creation process.

E.g. if it starts backing up file_with_ioerror_in_the_middle:

  • it will start reading and chunking the file, writing some chunks to the repo
  • then it runs into an I/O error, because one block of the source file is not readable
  • this aborts backup of that one file due to the exception, but it does not abort the complete backup
  • some chunks are already written to the repo, but no borg item is written to the metadata stream (borg would do that after all content data from this file is processed)
  • so these chunks might be:
    • orphan chunks if not also used by some other archive (new data, not deduplicated)
    • chunks with currently too high refcount (deduplicated data)

Maybe borg could do some cleanup in an exception handler, but IIRC it does not do that yet.

ThomasWaldmann avatar May 09 '22 17:05 ThomasWaldmann

None of this explains why there would be orphan chunks from years ago. I run borg check every month, and it never reported any errors until yesterday, so I don't think these orphan chunks could have arisen from I/O errors at the time of backup. Somehow they recently became orphans. Could there be a bug in prune? Or maybe several prune operations got interrupted, all in the same month, even though that never caused an error in the past five years?

jdchristensen avatar May 09 '22 18:05 jdchristensen

Yeah, guess that's right. Would be interesting to look at the plaintext of that chunk.

ThomasWaldmann avatar May 09 '22 18:05 ThomasWaldmann

I don't think it's worth the time/space to extract all chunks. I tried using

dd skip=15871355 count=1000000 if=system-bad.borg/data/1/946 of=chunk bs=1

to extract 1MB starting where the chunk starts (this is on the Debian system), and then using od -x chunk to view the bytes, but I don't see any zlib magic bytes there. A debug option to extract a chunk based on segment and offset would be handy.

Here is the start of the od -x chunk output:

0000000 3e63 51fa 0033 0000 6b00 4e73 88d3 e0e7
0000020 af86 0771 7c84 6d6b 343d 9ca2 e720 39e5
0000040 d7de 321b 6c60 57b8 02bd 0000 302d 332e
0000060 3832 100a 517c ceea 0000 0000 c4d8 9186
0000100 a311 4cb9 0ebb d948 783d 1c02 1a37 fc6f
0000120 040e 54d8 7699 1d58 3cf3 758c 7802 559c
0000140 4b8e 830e 0c30 ef44 35e2 ce41 721f 2e0e
0000160 8110 a8f9 5688 da49 e20d 0dee 5510 caea
0000200 f1e3 db1b 0c3b 793e 3b70 cff4 5c6d 5ba6
0000220 3a05 a06b 5031 4f9c fab9 c098 4081 196e
0000240 c66a 876d 29ca 44e9 8b53 2b56 0da9 2c97
0000260 885b 1c70 7547 ecf1 2f97 c339 7604 4e88
0000300 c769 8d4d d68d 95b2 96d6 055c 9f2f 99fa
0000320 f4ad 15a1 a99c 1220 340d 4b80 4de1 6779
0000340 fb3f a26d df98 4dc9 f1f2 e451 eb75 b21e
0000360 e325 58bc f227 ac67 bb7e 3c09 78be 49cd

jdchristensen avatar May 09 '22 18:05 jdchristensen

It's a bit hard to decipher due to the 16bit formatting with bytes swapped, but:

crc32 = 51fa3e63 
len32 = 00000033
tag8 = 00 (PUT)
id256 = 6b734ed388e7e086af7107847c6b6d3d34a29c20e7e539ded71b32606cb857bd
compression = 0200 (lzma or rather "xz")
data = ... (only 8 bytes)

Hmm, does not decompress, strange.

ThomasWaldmann avatar May 09 '22 21:05 ThomasWaldmann

It seems it is uncompressed data:

bytes.fromhex(d).decode()
'\x00-0.328\n'

ThomasWaldmann avatar May 09 '22 21:05 ThomasWaldmann

So a null byte, followed by the ascii string "-0.328\n". Strange. I confirmed this by adding 43 to the offset and extracting 8 bytes from the segment file. I don't know what that file could have been.

I should have mentioned that I use --compression auto,zlib, so some data won't be compressed, which happened here since such short data doesn't compress well. So I'm surprised there are bytes indicating compression.

Edit: for the record, the length above, 0x33, is 51 in decimal, and includes the 43-byte header, which is why the data portion is 8 bytes long.

jdchristensen avatar May 09 '22 23:05 jdchristensen

I repaired the third repository, the other one on the Ubuntu system. Similar results.

# borg check -v --repair /Backups/borg/system.borg
This is a potentially dangerous function.
check --repair might lead to data loss (for kinds of corruption it is not
capable of dealing with). BE VERY CAREFUL!

Type 'YES' if you understand this and want to continue: YES
Starting repository check
finished segment check at segment 28246
Starting repository index check
Finished full repository check, no problems found.
Starting archive consistency check...
Analyzing archive estrela-system-20170426-08:55:29 (1/289)
...
Analyzing archive yogi-system-20220509-02:04:03 (289/289)
2 orphaned objects found!
Deleting 2 orphaned and 294 superseded objects...
Finished deleting orphaned/superseded objects.
Writing Manifest.
Committing repo.
Archive consistency check complete, problems found.

jdchristensen avatar May 09 '22 23:05 jdchristensen

A theory how this could happen (see https://github.com/borgbackup/borg/issues/6687#issuecomment-1121333995):

  • borg check is ok
  • the repo loses the DEL somehow
  • the previously logically deleted chunk (which is not required because no archive references it) chunks gets undeleted due to that
  • borg check --repair finds the orphan chunk and deletes it again

Did some reviews, but did not find anything suspicious:

git diff -r 1.1.16 -r 1.2.0 src/borg/repository.py
git diff -r 1.1.16 -r 1.2.0 src/borg/_hashindex.c
git diff -r 1.1.16 -r 1.2.0 src/borg/hashindex.pyx

But one thing came to my mind:

  • borg 1.1.x did not persist shadow_index into the hints file (as there was no need to do that, it always used that information immediately within compact_segments in the same borg invocation)
  • borg 1.2 persists/loads shadow_index to/from hints file.
  • thus, when switching from 1.1 to 1.2, there is no shadow_index to load from hints and thus it will default to empty initially.

Have to investigate more whether that could be an issue (guess that shouldn't be an issue, the hints file could also just get lost at some time).

ThomasWaldmann avatar May 15 '22 23:05 ThomasWaldmann

I think I've ran into the same problem, but with Borg 1.1.6 and didn't upgrade at all. My repo only contains 7 days of archive history and am running repository and archives check after each backup each and every day without those problems in the past.

[...]
Remote: checking segment file /home/bak_borg/pve/data/81/81759...
Remote: checking segment file /home/bak_borg/pve/data/81/81760...
Remote: checking segment file /home/bak_borg/pve/data/81/81761...
Remote: checking segment file /home/bak_borg/pve/data/81/81766...
Remote: checking segment file /home/bak_borg/pve/data/81/81767...
Remote: checking segment file /home/bak_borg/pve/data/81/81768...
Remote: checking segment file /home/bak_borg/pve/data/81/81769...
Remote: checking segment file /home/bak_borg/pve/data/81/81770...
Remote: checking segment file /home/bak_borg/pve/data/81/81771...
Remote: checking segment file /home/bak_borg/pve/data/81/81772...
Remote: checking segment file /home/bak_borg/pve/data/81/81773...
Remote: checking segment file /home/bak_borg/pve/data/81/81774...
Remote: checking segment file /home/bak_borg/pve/data/81/81775...
Remote: checking segment file /home/bak_borg/pve/data/81/81776...
Remote: checking segment file /home/bak_borg/pve/data/81/81778...
Remote: checking segment file /home/bak_borg/pve/data/81/81780...
Remote: finished segment check at segment 81780
Remote: Starting repository index check
Remote: Index object count mismatch.
Remote: committed index: 865134 objects
Remote: rebuilt index:   865135 objects
Remote: ID: 9295605fcaf23f8609b0dae0f8e93459a8db5621415e70745cd8f24f39967644 rebuilt index: (38222, 119007523) committed index: <not found>
Remote: Finished full repository check, errors found.
RemoteRepository: 169 B bytes sent, 4.58 MB bytes received, 3 messages sent
terminating with warning status, rc 1
[...]sudo borgmatic --config ~/.config/borgmatic.d/pve.yaml --verbosity 2 borg check -v --repair
[...]
Remote: complete_xfer: deleting unused segment 81780
Remote: complete_xfer: deleting unused segment 81782
Remote: complete_xfer: deleting unused segment 81784
Remote: compaction freed about 1.07 kB repository space.
Remote: compaction completed.
Finished committing repo.
Archive consistency check complete, problems found.
RemoteRepository: 3.60 MB bytes sent, 258.55 MB bytes received, 1503 messages sent
terminating with success status, rc 0
[...]sudo borgmatic --config ~/.config/borgmatic.d/pve.yaml --verbosity 2 borg check -v --repair
[...]
Remote: not compacting segment 81761 (maybe freeable: 0.06% [205533 bytes])
Remote: not compacting segment 81766 (maybe freeable: 0.04% [232783 bytes])
Remote: not compacting segment 81776 (maybe freeable: 0.00% [6123 bytes])
Remote: compacting segment 81783 with usage count 0 (maybe freeable: 99.26% [1068 bytes])
Remote: compacting segment 81785 with usage count 0 (maybe freeable: 100.00% [17 bytes])
Remote: compacting segment 81786 with usage count 0 (maybe freeable: 83.67% [41 bytes])
Remote: compacting segment 81788 with usage count 0 (maybe freeable: 52.94% [9 bytes])
Remote: complete_xfer: wrote commit at segment 81789
Remote: complete_xfer: deleting unused segment 81783
Remote: complete_xfer: deleting unused segment 81785
Remote: complete_xfer: deleting unused segment 81786
Remote: complete_xfer: deleting unused segment 81788
Remote: compaction freed about 1.07 kB repository space.
Remote: compaction completed.
Finished committing repo.
Archive consistency check complete, no problems found.
RemoteRepository: 76.11 kB bytes sent, 246.55 MB bytes received, 1422 messages sent
terminating with success status, rc 0

ams-tschoening avatar May 20 '22 11:05 ams-tschoening

@ams-tschoening maybe also some orphaned chunks due to errors when processing input files.

You have a lot of debug msgs there, but for the repair I did not see the real issue. Maybe check the part you have shortened whether there is something interesting.

ThomasWaldmann avatar May 20 '22 13:05 ThomasWaldmann