borg icon indicating copy to clipboard operation
borg copied to clipboard

More Segment entry checksum mismatch

Open sshaikh opened this issue 1 year ago • 13 comments

Have you checked borgbackup docs, FAQ, and open GitHub issues?

Yes, and have acknowledged similar issues but feel that this may be different as the HW has passed tests.

Is this a BUG / ISSUE report or a QUESTION?

ISSUE

Client: borg 1.4.0 Server: borg 1.4.0

Client: Linux client 6.1.0-25-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.106-3 (2024-08-26) x86_64 GNU/Linux Server: Linux server 6.1.0-26-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.112-1 (2024-09-30) x86_64 GNU/Linux

Hardware / network configuration, and filesystems used.

Server: ext4

How much data is handled by borg?

~210GB

Full borg commandline that lead to the problem (leave away excludes and passwords)

I use borgmatic to create, prune, compact and check twice daily.

Describe the problem you're observing.

Every few months since upgrading to 1.4.0 I get Segment entry checksum mismatch errors on the remote repo. Repairs make them go away. I accept that this points to a HW issue but:

  1. The check in the same backup run fails, when the create, prune and compact immediately preceding it complete successfully. However, once in this state no actions can be taken on that repo.
  2. SMART and memtests have passed multiple runs. The SMART data looks fine and the "error" counts are within thresholds.
  3. Each time I run check (without a repair), I get a different (but possibly growing) set of Segment entry checksum mismatch errors (ie on different segments)

Can you reproduce the problem? If so, describe how. If not, describe troubleshooting steps you took before opening the issue.

Once the checks fail, then they always fail. However it is not clear when the failures start, only that the create, prune and compact immediately preceding them seem to succeed fully.

Include any warning/errors/backtraces from the system logs

$ sudo borg check -v -p  ssh://borg@server/backup/borg
Remote: Starting repository check
Remote: Data integrity error: Segment entry checksum mismatch [segment 6, offset 136653910]
Remote: Data integrity error: Segment entry checksum mismatch [segment 21, offset 343224890]
Remote: Data integrity error: Segment entry checksum mismatch [segment 22, offset 126726781]
Remote: Data integrity error: Segment entry checksum mismatch [segment 28, offset 465106529]
Remote: Data integrity error: Segment entry checksum mismatch [segment 31, offset 471072385]
Remote: Data integrity error: Segment entry checksum mismatch [segment 40, offset 94471655]
Remote: Data integrity error: Segment entry checksum mismatch [segment 78, offset 214925215]
Remote: Data integrity error: Segment entry checksum mismatch [segment 87, offset 57019635]
Remote: Data integrity error: Segment entry checksum mismatch [segment 88, offset 96948053]
Remote: Data integrity error: Segment entry checksum mismatch [segment 112, offset 65653905]

etc

I first reported the issue here (and closed it thinking it was a one off): https://github.com/borgbackup/borg/issues/8230

I have possibly more comprehensive info for this occurrence here: https://projects.torsion.org/borgmatic-collective/borgmatic/issues/920

I have not repaired the repo this time in case it needs to be looked at in this state.

sshaikh avatar Oct 15 '24 14:10 sshaikh

What is the expectation if borg running as a server encounters a hardware issue at the time of backup (or prune or compact)? Is it possible for them to complete to success?

sshaikh avatar Oct 15 '24 14:10 sshaikh

SMART output. FWIW this disc does spin down when not in use.

smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-26-amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.11
Device Model:     ST3500320AS
Serial Number:    
LU WWN Device Id: 5 000c50 009c05306
Firmware Version: SD1A
User Capacity:    500,107,862,016 bytes [500 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Device is:        In smartctl database 7.3/5319
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 2.6, 1.5 Gb/s
Local Time is:    Tue Oct 15 14:59:40 2024 BST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.


General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  625) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 113) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x103b) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   116   099   006    Pre-fail  Always       -       114380031
  3 Spin_Up_Time            0x0003   094   091   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   098   098   020    Old_age   Always       -       2744
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       16
  7 Seek_Error_Rate         0x000f   071   060   030    Pre-fail  Always       -       120722964166
  9 Power_On_Hours          0x0032   075   075   000    Old_age   Always       -       22285
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   098   037   020    Old_age   Always       -       2080
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   099   000    Old_age   Always       -       655373
189 High_Fly_Writes         0x003a   096   096   000    Old_age   Always       -       4
190 Airflow_Temperature_Cel 0x0022   070   044   045    Old_age   Always   In_the_past 30 (0 6 38 20 0)
194 Temperature_Celsius     0x0022   030   056   000    Old_age   Always       -       30 (0 18 0 0 0)
195 Hardware_ECC_Recovered  0x001a   048   022   000    Old_age   Always       -       114380031
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     22281         -
# 2  Short offline       Completed without error       00%     22257         -
# 3  Extended offline    Aborted by host               90%     22233         -
# 4  Short offline       Completed without error       00%     22206         -
# 5  Short offline       Completed without error       00%     22185         -
# 6  Short offline       Completed without error       00%     22161         -
# 7  Short offline       Completed without error       00%     22137         -
# 8  Short offline       Completed without error       00%     22113         -
# 9  Extended offline    Completed without error       00%     22099         -
#10  Short offline       Completed without error       00%     22090         -
#11  Extended offline    Aborted by host               80%     22067         -
#12  Short offline       Completed without error       00%     22039         -
#13  Short offline       Completed without error       00%     22013         -
#14  Short offline       Completed without error       00%     21994         -
#15  Short offline       Completed without error       00%     21970         -
#16  Short offline       Completed without error       00%     21946         -
#17  Short offline       Completed without error       00%     21927         -
#18  Extended offline    Aborted by host               80%     21904         -
#19  Short offline       Completed without error       00%     21876         -
#20  Short offline       Completed without error       00%     21855         -
#21  Short offline       Completed without error       00%     21843         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

sshaikh avatar Oct 15 '24 14:10 sshaikh

About the SMART stats:

  • it has 16 reallocated sectors. so it has some issues. you should keep an eye on this: if the count stays stable, it could be tolerated, but if the count is increasing, better get a new disk. i have already RMAd disks with fewer reallocated sectors.
  • there was a good extended offline test at 22099 hours.
  • you can avoid that an extended offline test is aborted by keeping the disk a bit busy, e.g. running some command that causes a disk access once a minute.

ThomasWaldmann avatar Oct 15 '24 14:10 ThomasWaldmann

About borg:

  • if you use a ssh: repo, the segment checksums (crc32) are computed server-side (on the machine running borg serve).
  • borg serve receives chunk data from the borg client, then computes the crc32, then stores checksum and data to disk as a "segment entry".

So if the checksum is incorrect, either the checksum or the data is corrupted after the checksum computation was done:

  • when storing the segment entry to disk
  • while the segment entry rests on disk (hdd media issues)
  • when reading the segment entry from disk
  • while the checksum and the data is in memory, before the checksum is checked

So, it is very likely either a RAM or disk issue, but theoretically it could be also other hw components involved.

ThomasWaldmann avatar Oct 15 '24 14:10 ThomasWaldmann

So, considering you have multiple error-free runs of memtest86+ (?) and the disk definitely has some issues, I'ld exchange the HDD.

It does not show many runtime hours, but otoh such 500GB SATA disks could be quite old overall (15y?), so maybe the runtime just had one or more overflows and it should show a much higher value?

ThomasWaldmann avatar Oct 15 '24 14:10 ThomasWaldmann

I have some alternative disks to try so happy to try that as an easy first step. I presume there's nothing more to do than just copying over the files? I'll leave this task open for a month or two.

sshaikh avatar Oct 15 '24 14:10 sshaikh

Yes, maybe use rsync -avH --delete (be careful) so you can restart in case it gets interrupted. After all is synced, you could do another rsync run using their "checksums checking" option just to make sure.

ThomasWaldmann avatar Oct 15 '24 15:10 ThomasWaldmann

BTW just for a sensibility check - it's only the disk that the repo resides on that is relevant for HW testing?

Oh and presumably I should repair-create-repair before moving data over?

sshaikh avatar Oct 15 '24 15:10 sshaikh

Suspect is all hw between the cpu/ram running "borg serve" and the disk where the repo data is stored on.

Better run the check --repair AFTER it is on the new disk.

ThomasWaldmann avatar Oct 15 '24 15:10 ThomasWaldmann

Is there a reason why the checks list different segments each time they're run?

sshaikh avatar Oct 15 '24 15:10 sshaikh

If you only run borg create, segments should not change, just new segments will be created (with higher numbers).

If you run borg compact though, it will compact from old to new segments. But when doing that, guess it will check CRC32 when reading old segments.

Iteration over segments might be in hashindex order for some commands, so do not assume that it is sorted or stable order.

ThomasWaldmann avatar Oct 15 '24 15:10 ThomasWaldmann

But these would be two non repairing checks that are run one after another, with no create, prune or compacts in between. Even if the order can be different (although they always seem to be in ascending order), the segments listed are different, as are the total number of segments listed. I would expect checks to be repeatable?

sshaikh avatar Oct 15 '24 15:10 sshaikh

That would rather indicate that your hw works in a non-reproducable / unstable way.

ThomasWaldmann avatar Oct 15 '24 15:10 ThomasWaldmann

I transferred the repo to a new disk and attempted to run the repair. I get the following:

$ sudo borg check -v -p --repair  ssh://borg@server/backup/borg
This is a potentially dangerous function.
check --repair might lead to data loss (for kinds of corruption it is not
capable of dealing with). BE VERY CAREFUL!

Type 'YES' if you understand this and want to continue: YES
Remote: Starting repository check
Remote: Data integrity error: Segment entry checksum mismatch [segment 123, offset 130189715]
Remote: attempting to recover /backup/borg/data/0/123
Remote: Data integrity error: Segment entry checksum mismatch [segment 132, offset 398640498]
Remote: attempting to recover /backup/borg/data/0/132
Data integrity error: Segment entry checksum mismatch [segment 132, offset 356431574]
Traceback (most recent call last):
  File "borg/archiver.py", line 5391, in main
  File "borg/archiver.py", line 5309, in run
  File "borg/archiver.py", line 191, in wrapper
  File "borg/archiver.py", line 365, in do_check
  File "borg/remote.py", line 491, in do_rpc
  File "borg/remote.py", line 727, in call
  File "borg/remote.py", line 836, in call_many
  File "borg/remote.py", line 772, in handle_error
borg.helpers.errors.IntegrityError: Data integrity error: Segment entry checksum mismatch [segment 132, offset 356431574]

Platform: Linux nas 6.1.0-25-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.106-3 (2024-08-26) x86_64
Linux: Unknown Linux
Borg: 1.4.0  Python: CPython 3.11.9 msgpack: 1.0.8 fuse: llfuse 1.5.0 [pyfuse3,llfuse]
PID: 1865711  CWD: /user
sys.argv: ['borg', 'check', '-v', '-p', '--repair', 'ssh://borg@server/backup/borg']
SSH_ORIGINAL_COMMAND: None

This seems a little odd that the repair fails?

sshaikh avatar Oct 20 '24 22:10 sshaikh

Yes. It is also odd that one does not see exactly where it fails.

Can you use a completely different machine, make a new copy of the repo and try the repo check again?

ThomasWaldmann avatar Oct 20 '24 23:10 ThomasWaldmann

In the meantime, the most recent scheduled backup completed successfully (including prunes, compacts and checks). Very strange!

I'll try the different HW stack next.

sshaikh avatar Oct 21 '24 11:10 sshaikh

Added the hardware issue label. Not 100% sure yet, but that's my best guess.

ThomasWaldmann avatar Oct 21 '24 12:10 ThomasWaldmann

Another data point: tried another check (non repair), post backup, and both client and server's borg processes have hung (0 CPU)

As the already copied repo set has moved on ("repo1"), I will go back to the repo version on the original disk (repo0) for the HW test.

sshaikh avatar Oct 21 '24 15:10 sshaikh

"repo0" on completely new HW checks fine (no errors) which points to a HW issue on the original server, although on the bright side it validates the original HDD. I think that's enough evidence.

I'm testing the RAM again and more thoroughly, but in the meantime can you advise on what borg does during check so I can possibly try to reproduce? Perhaps calculating hashes in a loop or something?

sshaikh avatar Oct 21 '24 20:10 sshaikh

The repository part of borg check reads all data in the segment files and does a crc32 check for each entry.

ThomasWaldmann avatar Oct 21 '24 22:10 ThomasWaldmann

I have switched the HDD out and although during testing I had been getting similar issues, I can't tell if that was due to a bad read off of the original drive during the copy or something else - either way things seem to have settled down now so I'll continue with caution and update this issue if the problem arises.

For now, I'll close it.

sshaikh avatar Oct 25 '24 17:10 sshaikh

Just to round this off, the errors were traced back to the RAM after all, which was especially interesting given they passed multiple passes of memtest. The quickest and most reliable way to reproduce (apart from a borg check) was to do an "aggregate" checksum on the repo using the script at https://superuser.com/questions/458326/sha1sum-for-a-directory-of-directories/1316125#1316125.

That said, I found the following pretty amusing (all commands run one after another):

$ sudo sha256sum ./data/8/8952
47fe0ee39112782911a19075d32d4f3b532c2193dcf3249855a9d01e87b4914d  ./data/8/8952
$ sudo sha256sum ./data/8/8952
b0f3033153d64ebfe288ab92050921c491a0752321499215dc16af9dc2f1ddc4  ./data/8/8952
$ sudo sha256sum ./data/8/8952
5c727ef68f4b787e9a3251266af100ac3009325fdf9f08994b6856cd8b26e2a4  ./data/8/8952
$ sudo sha256sum ./data/8/8952
2fb7d971b3bf45bc14fb8b7687b42dbb888cd34077a4c6315462efb028708d32  ./data/8/8952

So I guess borg check acts as a HW validator too!

sshaikh avatar Oct 30 '24 21:10 sshaikh

Looks like your RAMs have an identity crisis and rather feel like a RNG. :-)

ThomasWaldmann avatar Oct 31 '24 01:10 ThomasWaldmann

Just to finally close this off, the replacement ram has everything running fine. Many thanks for looking!

sshaikh avatar Nov 04 '24 12:11 sshaikh