gpdb icon indicating copy to clipboard operation
gpdb copied to clipboard

AO table `Header checksum does not match. Expected 0x0 and found 0xD49F4AA2 (SQLSTATE 22P04)`

Open cobolbaby opened this issue 4 years ago • 9 comments

Greenplum version or build

  • GP: 5.28.1

OS version and uname -a

  • Docker Container: Centos7
  • Docker Host: Linux mdw 4.15.0-117-generic # 118-Ubuntu SMP Fri Sep 4 20:02:41 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Actual behavior

When I performed a backup recently, I found the following error:

20201115:23:33:36 gpbackup:gpadmin:mdw:061595-[DEBUG]:-Writing data for table mes.pcbcomponenttrace_1_prt_60 to file
20201115:23:33:43 gpbackup:gpadmin:mdw:061595-[CRITICAL]:-ERROR: Error from segment 15: ERROR:  Header checksum does not match.  Expected 0x0 and found 0xD49F4AA2 (SQLSTATE 22P04)
github.com/greenplum-db/gpbackup/backup.BackupDataForAllTables
        /tmp/build/5f8239f8/go/src/github.com/greenplum-db/gpbackup/backup/data.go:167
github.com/greenplum-db/gpbackup/backup.backupData
        /tmp/build/5f8239f8/go/src/github.com/greenplum-db/gpbackup/backup/backup.go:321
github.com/greenplum-db/gpbackup/backup.DoBackup
        /tmp/build/5f8239f8/go/src/github.com/greenplum-db/gpbackup/backup/backup.go:181
main.main.func1
        /tmp/build/5f8239f8/go/src/github.com/greenplum-db/gpbackup/gpbackup.go:23
github.com/greenplum-db/gpbackup/vendor/github.com/spf13/cobra.(*Command).execute
        /tmp/build/5f8239f8/go/src/github.com/greenplum-db/gpbackup/vendor/github.com/spf13/cobra/command.go:766
github.com/greenplum-db/gpbackup/vendor/github.com/spf13/cobra.(*Command).ExecuteC
        /tmp/build/5f8239f8/go/src/github.com/greenplum-db/gpbackup/vendor/github.com/spf13/cobra/command.go:852
github.com/greenplum-db/gpbackup/vendor/github.com/spf13/cobra.(*Command).Execute
        /tmp/build/5f8239f8/go/src/github.com/greenplum-db/gpbackup/vendor/github.com/spf13/cobra/command.go:800
main.main
        /tmp/build/5f8239f8/go/src/github.com/greenplum-db/gpbackup/gpbackup.go:27
runtime.main
        /usr/local/go/src/runtime/proc.go:198
runtime.goexit
        /usr/local/go/src/runtime/asm_amd64.s:2361

I found a similar problem on the official forum, and then I performed the following operations according to the instructions. But after I completed the execution, the md5 value on the primary segment is still different from the one on the mirror.

# select count(1) from mes.pcbcomponenttrace_1_prt_60

ERROR:  Header checksum does not match.  Expected 0x0 and found 0xD49F4AA2  (seg15 slice1 10.12.0.41:40003 pid=23862)
DETAIL:  
Append-Only storage header kind 0 unknown
Scan of Append-Only Row-Oriented relation 'pcbcomponenttrace_1_prt_60'. Append-Only segment file 'base/189710/34812804.1', block header offset in file = 9439816, bufferCount 2960
SQL state: XX001
[gpadmin@mdw greenplum]$ ssh sdw4 md5sum /disk4/gpdata/gpsegment/primary/gpseg15/base/189710/34812804.1
# d680dfa823728175df51e846b450f810
[gpadmin@mdw greenplum]$ ssh sdw3 md5sum /disk4/gpdata/gpsegment/mirror/gpseg15/base/189710/34812804.1
# 8294b4c4a354f9caaaa9edc9e0996eb8
# sdw4
[gpadmin@sdw4 greenplum]$ pg_ctl -D /disk4/gpdata/gpsegment/primary/gpseg15 stop -m fast
# mdw
[gpadmin@mdw greenplum]$ gprecovery -a
[gpadmin@mdw greenplum]$ gprecovery -r -a
[gpadmin@mdw greenplum]$ ssh sdw4 md5sum /disk4/gpdata/gpsegment/primary/gpseg15/base/189710/34812804.1
d680dfa823728175df51e846b450f810  /disk4/gpdata/gpsegment/primary/gpseg15/base/189710/34812804.1
[gpadmin@mdw greenplum]$ ssh sdw3 md5sum /disk4/gpdata/gpsegment/mirror/gpseg15/base/189710/34812804.1
8294b4c4a354f9caaaa9edc9e0996eb8  /disk4/gpdata/gpsegment/mirror/gpseg15/base/189710/34812804.1

2020-11-16 14-19-08屏幕截图

cobolbaby avatar Nov 16 '20 06:11 cobolbaby

Duplicate of https://github.com/greenplum-db/gpbackup/issues/447

cobolbaby avatar Nov 16 '20 06:11 cobolbaby

From the documentation you posted:

After the successful data access on the table, a full recovery is needed to take care of any other affected relation.

It seems you ran only incremental recovery with gprecoverseg -a. To run full recovery, you need to add the -F flag (e.g. gprecoverseg -aF). This will essentially recreate your downed primary segment by creating an image of your acting primary segment (the mirror segment you failed over to by manually stopping the primary segment).

jimmyyih avatar Nov 18 '20 08:11 jimmyyih

Got it. But because the data volume of each node is already relatively large, this operation may cause long-term IO problems.

I plan to use the following methods for data recovery after suspending the insert operation:

# sdw4
[gpadmin@mdw greenplum]$ pg_ctl -D /disk4/gpdata/gpsegment/primary/gpseg15 stop -m fast
# mdw
[gpadmin@mdw greenplum]$ gprecoverseg -a
[gpadmin@mdw greenplum]$ psql# create table mes.pcbcomponenttrace_temp as select * from mes.pcbcomponenttrace_1_prt_60 DISTRIBUTED BY (panelid);
[gpadmin@mdw greenplum]$ psql# truncate  mes.pcbcomponenttrace_1_prt_60;
[gpadmin@mdw greenplum]$ psql# insert into mes.pcbcomponenttrace select * from mes.pcbcomponenttrace_temp;
[gpadmin@mdw greenplum]$ psql# drop table mes.pcbcomponenttrace_temp;
[gpadmin@mdw greenplum]$ gprecoverseg -r -a

cobolbaby avatar Nov 19 '20 06:11 cobolbaby

The above method has been verified.

cobolbaby avatar Nov 20 '20 02:11 cobolbaby

The same problem happened in 6.19.1.

cobolbaby avatar May 07 '22 10:05 cobolbaby

The same problem happened in 6.19.1.

Please can you provide more color to the problem like operations performed on the table (pg_stat_last_operation should be able to provide the details). After what operation you are seeing this behavior. If the primary and mirror files the same? If not same does mirror file also showcase the problem (this can be checked by copying mirror file to primary - after backing up primary file and running the select)? Did you check the primary host hardware if reporting any issues.

ashwinstar avatar May 08 '22 02:05 ashwinstar

I have faced this problem many times recently, and the phenomenon is that the data query on the primary segment does not work, but the query on the mirror segment works.

cobolbaby avatar Jul 12 '22 04:07 cobolbaby

I have faced this problem many times recently, and the phenomenon is that the data query on the primary segment does not work, but the query on the mirror segment works.

Hi @cobolbaby , suggest to check the disk of primary. there might be some bad sectors so data will get broken silently. do you have change to replace the disk?

lij55 avatar Aug 02 '22 03:08 lij55

The monitoring system has not detected IO errors, so there is no plan to replace the disk for the time being.

cobolbaby avatar Aug 02 '22 05:08 cobolbaby

I am supposing you are using GP6. you can try to manually copy data directory from mirror to the primary (backup first!!). you can copy only part of files if you can identify the data files of the table in question.

lij55 avatar Aug 16 '22 09:08 lij55

I'd like to close this jira as there is nothing greenplum could do when it detect wrong checksum except raising error.

lij55 avatar Aug 16 '22 09:08 lij55