cloudberry icon indicating copy to clipboard operation
cloudberry copied to clipboard

[Bug] gp_replica_check fails frequently due to inconsistencies detected between primary and mirror node

Open congxuebin opened this issue 1 year ago • 4 comments

Cloudberry Database version

PostgreSQL 14.4 (Cloudberry Database 1.6.0+dev.36.g9c96b207 build 81670 commit:9c96b207)

What happened

gp_replica_check fails frequently. The following is output of the failure. It appears it retried multiple times still failed. This would lead to data integrity issues and disrupt the reliability of the database system.

(Thread-54) Host: cbdb-release-pipeline-81670-job-950624, Port: 7004, Database: dsp1
Primary Data Directory Location: /code/cbdb_src/gpAux/gpdemo/datadirs/dbfast3/demoDataDir2
Mirror Data Directory Location: /code/cbdb_src/gpAux/gpdemo/datadirs/dbfast_mirror3/demoDataDir2
NOTICE:  heap files "/code/cbdb_src/gpAux/gpdemo/datadirs/dbfast3/demoDataDir2/base/104331/1255" and "/code/cbdb_src/gpAux/gpdemo/datadirs/dbfast_mirror3/demoDataDir2/base/104331/1255" for relation "pg_proc" mismatch by -56 at blockno 3
NOTICE:  succeeded after retrying
NOTICE:  heap files "/code/cbdb_src/gpAux/gpdemo/datadirs/dbfast3/demoDataDir2/base/104331/1259" and "/code/cbdb_src/gpAux/gpdemo/datadirs/dbfast_mirror3/demoDataDir2/base/104331/1259" for relation "pg_class" mismatch by -104 at blockno 2
NOTICE:  heap files "/code/cbdb_src/gpAux/gpdemo/datadirs/dbfast3/demoDataDir2/base/104331/1259" and "/code/cbdb_src/gpAux/gpdemo/datadirs/dbfast_mirror3/demoDataDir2/base/104331/1259" for relation "pg_class" mismatch by -104 at blockno 2
NOTICE:  heap files "/code/cbdb_src/gpAux/gpdemo/datadirs/dbfast3/demoDataDir2/base/104331/1259" and "/code/cbdb_src/gpAux/gpdemo/datadirs/dbfast_mirror3/demoDataDir2/base/104331/1259" for relation "pg_class" mismatch by -104 at blockno 2
WARNING:  heap files "/code/cbdb_src/gpAux/gpdemo/datadirs/dbfast3/demoDataDir2/base/104331/1259" and "/code/cbdb_src/gpAux/gpdemo/datadirs/dbfast_mirror3/demoDataDir2/base/104331/1259" for relation "pg_class" mismatch at blockno 2, gave up after 3 retries
 gp_replica_check 
------------------
 f
(1 row)

What you think should happen instead

No response

How to reproduce

run ICW test which would including this test in it.

run the following to run gp_replica_check alone. But simply running it won't recreate the problem unless you run ICW test before it.

cd /code/cbdb_src/gpcontrib/gp_replica_check
make installcheck

Operating System

centos8, centos9, uos, etc I don't think it is os related

Anything else

No response

Are you willing to submit PR?

  • [ ] Yes, I am willing to submit a PR!

Code of Conduct

congxuebin avatar Jul 31 '24 03:07 congxuebin

A complete log of gp_replica_check. gp_replica_check.txt

congxuebin avatar Jul 31 '24 03:07 congxuebin

show fsync; ?

yjhjstz avatar Aug 21 '24 06:08 yjhjstz

@yjhjstz As we discussed, I was unable to reproduce the issue with the default fsync=off setting when I ran it manually. It would be acceptable to check it when the PR is merged. Will reopen if this reoccurs.

(Thread-26) Host: cbdb-release-pipeline-83598-job-981711, Port: 7000, Database: contrib_regression
Primary Data Directory Location: /code/cbdb_src/gpAux/gpdemo/datadirs/qddir/demoDataDir-1
Mirror Data Directory Location: /code/cbdb_src/gpAux/gpdemo/datadirs/standby
 gp_replica_check 
------------------
 t
(1 row)


(Thread-24) Host: cbdb-release-pipeline-83598-job-981711, Port: 7000, Database: gptest
Primary Data Directory Location: /code/cbdb_src/gpAux/gpdemo/datadirs/qddir/demoDataDir-1
Mirror Data Directory Location: /code/cbdb_src/gpAux/gpdemo/datadirs/standby
 gp_replica_check 
------------------
 t
(1 row)


(Thread-30) Host: cbdb-release-pipeline-83598-job-981711, Port: 7004, Database: dsp1
Primary Data Directory Location: /code/cbdb_src/gpAux/gpdemo/datadirs/dbfast3/demoDataDir2
Mirror Data Directory Location: /code/cbdb_src/gpAux/gpdemo/datadirs/dbfast_mirror3/demoDataDir2
 gp_replica_check 
------------------
 t
(1 row)


(Thread-25) Host: cbdb-release-pipeline-83598-job-981711, Port: 7000, Database: reuse_gptest
Primary Data Directory Location: /code/cbdb_src/gpAux/gpdemo/datadirs/qddir/demoDataDir-1
Mirror Data Directory Location: /code/cbdb_src/gpAux/gpdemo/datadirs/standby
 gp_replica_check 
------------------
 t
(1 row)


(Thread-28) Host: cbdb-release-pipeline-83598-job-981711, Port: 7004, Database: template1
Primary Data Directory Location: /code/cbdb_src/gpAux/gpdemo/datadirs/dbfast3/demoDataDir2
Mirror Data Directory Location: /code/cbdb_src/gpAux/gpdemo/datadirs/dbfast_mirror3/demoDataDir2
 gp_replica_check 
------------------
 t
(1 row)


(Thread-29) Host: cbdb-release-pipeline-83598-job-981711, Port: 7004, Database: gpadmin
Primary Data Directory Location: /code/cbdb_src/gpAux/gpdemo/datadirs/dbfast3/demoDataDir2
Mirror Data Directory Location: /code/cbdb_src/gpAux/gpdemo/datadirs/dbfast_mirror3/demoDataDir2
 gp_replica_check 
------------------
 t
(1 row)


(Thread-31) Host: cbdb-release-pipeline-83598-job-981711, Port: 7004, Database: dsp2
Primary Data Directory Location: /code/cbdb_src/gpAux/gpdemo/datadirs/dbfast3/demoDataDir2
Mirror Data Directory Location: /code/cbdb_src/gpAux/gpdemo/datadirs/dbfast_mirror3/demoDataDir2
 gp_replica_check 
------------------
 t
(1 row)

congxuebin avatar Aug 21 '24 09:08 congxuebin

Hi @yjhjstz Jiang Hua, I have to reopen this issue as after the code merged, we still see the issue.

congxuebin avatar Sep 30 '24 02:09 congxuebin