gpdb
gpdb copied to clipboard
AO table `Header checksum does not match. Expected 0x0 and found 0xD49F4AA2 (SQLSTATE 22P04)`
Greenplum version or build
- GP: 5.28.1
OS version and uname -a
- Docker Container: Centos7
- Docker Host: Linux mdw 4.15.0-117-generic # 118-Ubuntu SMP Fri Sep 4 20:02:41 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Actual behavior
When I performed a backup recently, I found the following error:
20201115:23:33:36 gpbackup:gpadmin:mdw:061595-[DEBUG]:-Writing data for table mes.pcbcomponenttrace_1_prt_60 to file
20201115:23:33:43 gpbackup:gpadmin:mdw:061595-[CRITICAL]:-ERROR: Error from segment 15: ERROR: Header checksum does not match. Expected 0x0 and found 0xD49F4AA2 (SQLSTATE 22P04)
github.com/greenplum-db/gpbackup/backup.BackupDataForAllTables
/tmp/build/5f8239f8/go/src/github.com/greenplum-db/gpbackup/backup/data.go:167
github.com/greenplum-db/gpbackup/backup.backupData
/tmp/build/5f8239f8/go/src/github.com/greenplum-db/gpbackup/backup/backup.go:321
github.com/greenplum-db/gpbackup/backup.DoBackup
/tmp/build/5f8239f8/go/src/github.com/greenplum-db/gpbackup/backup/backup.go:181
main.main.func1
/tmp/build/5f8239f8/go/src/github.com/greenplum-db/gpbackup/gpbackup.go:23
github.com/greenplum-db/gpbackup/vendor/github.com/spf13/cobra.(*Command).execute
/tmp/build/5f8239f8/go/src/github.com/greenplum-db/gpbackup/vendor/github.com/spf13/cobra/command.go:766
github.com/greenplum-db/gpbackup/vendor/github.com/spf13/cobra.(*Command).ExecuteC
/tmp/build/5f8239f8/go/src/github.com/greenplum-db/gpbackup/vendor/github.com/spf13/cobra/command.go:852
github.com/greenplum-db/gpbackup/vendor/github.com/spf13/cobra.(*Command).Execute
/tmp/build/5f8239f8/go/src/github.com/greenplum-db/gpbackup/vendor/github.com/spf13/cobra/command.go:800
main.main
/tmp/build/5f8239f8/go/src/github.com/greenplum-db/gpbackup/gpbackup.go:27
runtime.main
/usr/local/go/src/runtime/proc.go:198
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:2361
I found a similar problem on the official forum, and then I performed the following operations according to the instructions. But after I completed the execution, the md5 value on the primary segment is still different from the one on the mirror.
# select count(1) from mes.pcbcomponenttrace_1_prt_60
ERROR: Header checksum does not match. Expected 0x0 and found 0xD49F4AA2 (seg15 slice1 10.12.0.41:40003 pid=23862)
DETAIL:
Append-Only storage header kind 0 unknown
Scan of Append-Only Row-Oriented relation 'pcbcomponenttrace_1_prt_60'. Append-Only segment file 'base/189710/34812804.1', block header offset in file = 9439816, bufferCount 2960
SQL state: XX001
[gpadmin@mdw greenplum]$ ssh sdw4 md5sum /disk4/gpdata/gpsegment/primary/gpseg15/base/189710/34812804.1
# d680dfa823728175df51e846b450f810
[gpadmin@mdw greenplum]$ ssh sdw3 md5sum /disk4/gpdata/gpsegment/mirror/gpseg15/base/189710/34812804.1
# 8294b4c4a354f9caaaa9edc9e0996eb8
# sdw4
[gpadmin@sdw4 greenplum]$ pg_ctl -D /disk4/gpdata/gpsegment/primary/gpseg15 stop -m fast
# mdw
[gpadmin@mdw greenplum]$ gprecovery -a
[gpadmin@mdw greenplum]$ gprecovery -r -a
[gpadmin@mdw greenplum]$ ssh sdw4 md5sum /disk4/gpdata/gpsegment/primary/gpseg15/base/189710/34812804.1
d680dfa823728175df51e846b450f810 /disk4/gpdata/gpsegment/primary/gpseg15/base/189710/34812804.1
[gpadmin@mdw greenplum]$ ssh sdw3 md5sum /disk4/gpdata/gpsegment/mirror/gpseg15/base/189710/34812804.1
8294b4c4a354f9caaaa9edc9e0996eb8 /disk4/gpdata/gpsegment/mirror/gpseg15/base/189710/34812804.1
Duplicate of https://github.com/greenplum-db/gpbackup/issues/447
From the documentation you posted:
After the successful data access on the table, a full recovery is needed to take care of any other affected relation.
It seems you ran only incremental recovery with gprecoverseg -a
. To run full recovery, you need to add the -F flag (e.g. gprecoverseg -aF
). This will essentially recreate your downed primary segment by creating an image of your acting primary segment (the mirror segment you failed over to by manually stopping the primary segment).
Got it. But because the data volume of each node is already relatively large, this operation may cause long-term IO problems.
I plan to use the following methods for data recovery after suspending the insert operation:
# sdw4
[gpadmin@mdw greenplum]$ pg_ctl -D /disk4/gpdata/gpsegment/primary/gpseg15 stop -m fast
# mdw
[gpadmin@mdw greenplum]$ gprecoverseg -a
[gpadmin@mdw greenplum]$ psql# create table mes.pcbcomponenttrace_temp as select * from mes.pcbcomponenttrace_1_prt_60 DISTRIBUTED BY (panelid);
[gpadmin@mdw greenplum]$ psql# truncate mes.pcbcomponenttrace_1_prt_60;
[gpadmin@mdw greenplum]$ psql# insert into mes.pcbcomponenttrace select * from mes.pcbcomponenttrace_temp;
[gpadmin@mdw greenplum]$ psql# drop table mes.pcbcomponenttrace_temp;
[gpadmin@mdw greenplum]$ gprecoverseg -r -a
The above method has been verified.
The same problem happened in 6.19.1.
The same problem happened in 6.19.1.
Please can you provide more color to the problem like operations performed on the table (pg_stat_last_operation
should be able to provide the details). After what operation you are seeing this behavior. If the primary and mirror files the same? If not same does mirror file also showcase the problem (this can be checked by copying mirror file to primary - after backing up primary file and running the select)? Did you check the primary host hardware if reporting any issues.
I have faced this problem many times recently, and the phenomenon is that the data query on the primary segment does not work, but the query on the mirror segment works.
I have faced this problem many times recently, and the phenomenon is that the data query on the primary segment does not work, but the query on the mirror segment works.
Hi @cobolbaby , suggest to check the disk of primary. there might be some bad sectors so data will get broken silently. do you have change to replace the disk?
The monitoring system has not detected IO errors, so there is no plan to replace the disk for the time being.
I am supposing you are using GP6. you can try to manually copy data directory from mirror to the primary (backup first!!). you can copy only part of files if you can identify the data files of the table in question.
I'd like to close this jira as there is nothing greenplum could do when it detect wrong checksum except raising error.