More Segment entry checksum mismatch
Have you checked borgbackup docs, FAQ, and open GitHub issues?
Yes, and have acknowledged similar issues but feel that this may be different as the HW has passed tests.
Is this a BUG / ISSUE report or a QUESTION?
ISSUE
Client: borg 1.4.0 Server: borg 1.4.0
Client: Linux client 6.1.0-25-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.106-3 (2024-08-26) x86_64 GNU/Linux Server: Linux server 6.1.0-26-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.112-1 (2024-09-30) x86_64 GNU/Linux
Hardware / network configuration, and filesystems used.
Server: ext4
How much data is handled by borg?
~210GB
Full borg commandline that lead to the problem (leave away excludes and passwords)
I use borgmatic to create, prune, compact and check twice daily.
Describe the problem you're observing.
Every few months since upgrading to 1.4.0 I get Segment entry checksum mismatch errors on the remote repo. Repairs make them go away. I accept that this points to a HW issue but:
- The
checkin the same backup run fails, when the create, prune and compact immediately preceding it complete successfully. However, once in this state no actions can be taken on that repo. - SMART and memtests have passed multiple runs. The SMART data looks fine and the "error" counts are within thresholds.
- Each time I run
check(without a repair), I get a different (but possibly growing) set ofSegment entry checksum mismatcherrors (ie on different segments)
Can you reproduce the problem? If so, describe how. If not, describe troubleshooting steps you took before opening the issue.
Once the checks fail, then they always fail. However it is not clear when the failures start, only that the create, prune and compact immediately preceding them seem to succeed fully.
Include any warning/errors/backtraces from the system logs
$ sudo borg check -v -p ssh://borg@server/backup/borg
Remote: Starting repository check
Remote: Data integrity error: Segment entry checksum mismatch [segment 6, offset 136653910]
Remote: Data integrity error: Segment entry checksum mismatch [segment 21, offset 343224890]
Remote: Data integrity error: Segment entry checksum mismatch [segment 22, offset 126726781]
Remote: Data integrity error: Segment entry checksum mismatch [segment 28, offset 465106529]
Remote: Data integrity error: Segment entry checksum mismatch [segment 31, offset 471072385]
Remote: Data integrity error: Segment entry checksum mismatch [segment 40, offset 94471655]
Remote: Data integrity error: Segment entry checksum mismatch [segment 78, offset 214925215]
Remote: Data integrity error: Segment entry checksum mismatch [segment 87, offset 57019635]
Remote: Data integrity error: Segment entry checksum mismatch [segment 88, offset 96948053]
Remote: Data integrity error: Segment entry checksum mismatch [segment 112, offset 65653905]
etc
I first reported the issue here (and closed it thinking it was a one off): https://github.com/borgbackup/borg/issues/8230
I have possibly more comprehensive info for this occurrence here: https://projects.torsion.org/borgmatic-collective/borgmatic/issues/920
I have not repaired the repo this time in case it needs to be looked at in this state.
What is the expectation if borg running as a server encounters a hardware issue at the time of backup (or prune or compact)? Is it possible for them to complete to success?
SMART output. FWIW this disc does spin down when not in use.
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-26-amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.11
Device Model: ST3500320AS
Serial Number:
LU WWN Device Id: 5 000c50 009c05306
Firmware Version: SD1A
User Capacity: 500,107,862,016 bytes [500 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: 7200 rpm
Device is: In smartctl database 7.3/5319
ATA Version is: ATA8-ACS T13/1699-D revision 4
SATA Version is: SATA 2.6, 1.5 Gb/s
Local Time is: Tue Oct 15 14:59:40 2024 BST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 625) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 113) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x103b) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 116 099 006 Pre-fail Always - 114380031
3 Spin_Up_Time 0x0003 094 091 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 098 098 020 Old_age Always - 2744
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 16
7 Seek_Error_Rate 0x000f 071 060 030 Pre-fail Always - 120722964166
9 Power_On_Hours 0x0032 075 075 000 Old_age Always - 22285
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 098 037 020 Old_age Always - 2080
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 099 000 Old_age Always - 655373
189 High_Fly_Writes 0x003a 096 096 000 Old_age Always - 4
190 Airflow_Temperature_Cel 0x0022 070 044 045 Old_age Always In_the_past 30 (0 6 38 20 0)
194 Temperature_Celsius 0x0022 030 056 000 Old_age Always - 30 (0 18 0 0 0)
195 Hardware_ECC_Recovered 0x001a 048 022 000 Old_age Always - 114380031
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 22281 -
# 2 Short offline Completed without error 00% 22257 -
# 3 Extended offline Aborted by host 90% 22233 -
# 4 Short offline Completed without error 00% 22206 -
# 5 Short offline Completed without error 00% 22185 -
# 6 Short offline Completed without error 00% 22161 -
# 7 Short offline Completed without error 00% 22137 -
# 8 Short offline Completed without error 00% 22113 -
# 9 Extended offline Completed without error 00% 22099 -
#10 Short offline Completed without error 00% 22090 -
#11 Extended offline Aborted by host 80% 22067 -
#12 Short offline Completed without error 00% 22039 -
#13 Short offline Completed without error 00% 22013 -
#14 Short offline Completed without error 00% 21994 -
#15 Short offline Completed without error 00% 21970 -
#16 Short offline Completed without error 00% 21946 -
#17 Short offline Completed without error 00% 21927 -
#18 Extended offline Aborted by host 80% 21904 -
#19 Short offline Completed without error 00% 21876 -
#20 Short offline Completed without error 00% 21855 -
#21 Short offline Completed without error 00% 21843 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
About the SMART stats:
- it has 16 reallocated sectors. so it has some issues. you should keep an eye on this: if the count stays stable, it could be tolerated, but if the count is increasing, better get a new disk. i have already RMAd disks with fewer reallocated sectors.
- there was a good extended offline test at 22099 hours.
- you can avoid that an extended offline test is aborted by keeping the disk a bit busy, e.g. running some command that causes a disk access once a minute.
About borg:
- if you use a
ssh:repo, the segment checksums (crc32) are computed server-side (on the machine runningborg serve). - borg serve receives chunk data from the borg client, then computes the crc32, then stores checksum and data to disk as a "segment entry".
So if the checksum is incorrect, either the checksum or the data is corrupted after the checksum computation was done:
- when storing the segment entry to disk
- while the segment entry rests on disk (hdd media issues)
- when reading the segment entry from disk
- while the checksum and the data is in memory, before the checksum is checked
So, it is very likely either a RAM or disk issue, but theoretically it could be also other hw components involved.
So, considering you have multiple error-free runs of memtest86+ (?) and the disk definitely has some issues, I'ld exchange the HDD.
It does not show many runtime hours, but otoh such 500GB SATA disks could be quite old overall (15y?), so maybe the runtime just had one or more overflows and it should show a much higher value?
I have some alternative disks to try so happy to try that as an easy first step. I presume there's nothing more to do than just copying over the files? I'll leave this task open for a month or two.
Yes, maybe use rsync -avH --delete (be careful) so you can restart in case it gets interrupted. After all is synced, you could do another rsync run using their "checksums checking" option just to make sure.
BTW just for a sensibility check - it's only the disk that the repo resides on that is relevant for HW testing?
Oh and presumably I should repair-create-repair before moving data over?
Suspect is all hw between the cpu/ram running "borg serve" and the disk where the repo data is stored on.
Better run the check --repair AFTER it is on the new disk.
Is there a reason why the checks list different segments each time they're run?
If you only run borg create, segments should not change, just new segments will be created (with higher numbers).
If you run borg compact though, it will compact from old to new segments. But when doing that, guess it will check CRC32 when reading old segments.
Iteration over segments might be in hashindex order for some commands, so do not assume that it is sorted or stable order.
But these would be two non repairing checks that are run one after another, with no create, prune or compacts in between. Even if the order can be different (although they always seem to be in ascending order), the segments listed are different, as are the total number of segments listed. I would expect checks to be repeatable?
That would rather indicate that your hw works in a non-reproducable / unstable way.
I transferred the repo to a new disk and attempted to run the repair. I get the following:
$ sudo borg check -v -p --repair ssh://borg@server/backup/borg
This is a potentially dangerous function.
check --repair might lead to data loss (for kinds of corruption it is not
capable of dealing with). BE VERY CAREFUL!
Type 'YES' if you understand this and want to continue: YES
Remote: Starting repository check
Remote: Data integrity error: Segment entry checksum mismatch [segment 123, offset 130189715]
Remote: attempting to recover /backup/borg/data/0/123
Remote: Data integrity error: Segment entry checksum mismatch [segment 132, offset 398640498]
Remote: attempting to recover /backup/borg/data/0/132
Data integrity error: Segment entry checksum mismatch [segment 132, offset 356431574]
Traceback (most recent call last):
File "borg/archiver.py", line 5391, in main
File "borg/archiver.py", line 5309, in run
File "borg/archiver.py", line 191, in wrapper
File "borg/archiver.py", line 365, in do_check
File "borg/remote.py", line 491, in do_rpc
File "borg/remote.py", line 727, in call
File "borg/remote.py", line 836, in call_many
File "borg/remote.py", line 772, in handle_error
borg.helpers.errors.IntegrityError: Data integrity error: Segment entry checksum mismatch [segment 132, offset 356431574]
Platform: Linux nas 6.1.0-25-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.106-3 (2024-08-26) x86_64
Linux: Unknown Linux
Borg: 1.4.0 Python: CPython 3.11.9 msgpack: 1.0.8 fuse: llfuse 1.5.0 [pyfuse3,llfuse]
PID: 1865711 CWD: /user
sys.argv: ['borg', 'check', '-v', '-p', '--repair', 'ssh://borg@server/backup/borg']
SSH_ORIGINAL_COMMAND: None
This seems a little odd that the repair fails?
Yes. It is also odd that one does not see exactly where it fails.
Can you use a completely different machine, make a new copy of the repo and try the repo check again?
In the meantime, the most recent scheduled backup completed successfully (including prunes, compacts and checks). Very strange!
I'll try the different HW stack next.
Added the hardware issue label. Not 100% sure yet, but that's my best guess.
Another data point: tried another check (non repair), post backup, and both client and server's borg processes have hung (0 CPU)
As the already copied repo set has moved on ("repo1"), I will go back to the repo version on the original disk (repo0) for the HW test.
"repo0" on completely new HW checks fine (no errors) which points to a HW issue on the original server, although on the bright side it validates the original HDD. I think that's enough evidence.
I'm testing the RAM again and more thoroughly, but in the meantime can you advise on what borg does during check so I can possibly try to reproduce? Perhaps calculating hashes in a loop or something?
The repository part of borg check reads all data in the segment files and does a crc32 check for each entry.
I have switched the HDD out and although during testing I had been getting similar issues, I can't tell if that was due to a bad read off of the original drive during the copy or something else - either way things seem to have settled down now so I'll continue with caution and update this issue if the problem arises.
For now, I'll close it.
Just to round this off, the errors were traced back to the RAM after all, which was especially interesting given they passed multiple passes of memtest. The quickest and most reliable way to reproduce (apart from a borg check) was to do an "aggregate" checksum on the repo using the script at https://superuser.com/questions/458326/sha1sum-for-a-directory-of-directories/1316125#1316125.
That said, I found the following pretty amusing (all commands run one after another):
$ sudo sha256sum ./data/8/8952
47fe0ee39112782911a19075d32d4f3b532c2193dcf3249855a9d01e87b4914d ./data/8/8952
$ sudo sha256sum ./data/8/8952
b0f3033153d64ebfe288ab92050921c491a0752321499215dc16af9dc2f1ddc4 ./data/8/8952
$ sudo sha256sum ./data/8/8952
5c727ef68f4b787e9a3251266af100ac3009325fdf9f08994b6856cd8b26e2a4 ./data/8/8952
$ sudo sha256sum ./data/8/8952
2fb7d971b3bf45bc14fb8b7687b42dbb888cd34077a4c6315462efb028708d32 ./data/8/8952
So I guess borg check acts as a HW validator too!
Looks like your RAMs have an identity crisis and rather feel like a RNG. :-)
Just to finally close this off, the replacement ram has everything running fine. Many thanks for looking!