[Bug] gp_replica_check fails frequently due to inconsistencies detected between primary and mirror node
Cloudberry Database version
PostgreSQL 14.4 (Cloudberry Database 1.6.0+dev.36.g9c96b207 build 81670 commit:9c96b207)
What happened
gp_replica_check fails frequently. The following is output of the failure. It appears it retried multiple times still failed. This would lead to data integrity issues and disrupt the reliability of the database system.
(Thread-54) Host: cbdb-release-pipeline-81670-job-950624, Port: 7004, Database: dsp1
Primary Data Directory Location: /code/cbdb_src/gpAux/gpdemo/datadirs/dbfast3/demoDataDir2
Mirror Data Directory Location: /code/cbdb_src/gpAux/gpdemo/datadirs/dbfast_mirror3/demoDataDir2
NOTICE: heap files "/code/cbdb_src/gpAux/gpdemo/datadirs/dbfast3/demoDataDir2/base/104331/1255" and "/code/cbdb_src/gpAux/gpdemo/datadirs/dbfast_mirror3/demoDataDir2/base/104331/1255" for relation "pg_proc" mismatch by -56 at blockno 3
NOTICE: succeeded after retrying
NOTICE: heap files "/code/cbdb_src/gpAux/gpdemo/datadirs/dbfast3/demoDataDir2/base/104331/1259" and "/code/cbdb_src/gpAux/gpdemo/datadirs/dbfast_mirror3/demoDataDir2/base/104331/1259" for relation "pg_class" mismatch by -104 at blockno 2
NOTICE: heap files "/code/cbdb_src/gpAux/gpdemo/datadirs/dbfast3/demoDataDir2/base/104331/1259" and "/code/cbdb_src/gpAux/gpdemo/datadirs/dbfast_mirror3/demoDataDir2/base/104331/1259" for relation "pg_class" mismatch by -104 at blockno 2
NOTICE: heap files "/code/cbdb_src/gpAux/gpdemo/datadirs/dbfast3/demoDataDir2/base/104331/1259" and "/code/cbdb_src/gpAux/gpdemo/datadirs/dbfast_mirror3/demoDataDir2/base/104331/1259" for relation "pg_class" mismatch by -104 at blockno 2
WARNING: heap files "/code/cbdb_src/gpAux/gpdemo/datadirs/dbfast3/demoDataDir2/base/104331/1259" and "/code/cbdb_src/gpAux/gpdemo/datadirs/dbfast_mirror3/demoDataDir2/base/104331/1259" for relation "pg_class" mismatch at blockno 2, gave up after 3 retries
gp_replica_check
------------------
f
(1 row)
What you think should happen instead
No response
How to reproduce
run ICW test which would including this test in it.
run the following to run gp_replica_check alone. But simply running it won't recreate the problem unless you run ICW test before it.
cd /code/cbdb_src/gpcontrib/gp_replica_check
make installcheck
Operating System
centos8, centos9, uos, etc I don't think it is os related
Anything else
No response
Are you willing to submit PR?
- [ ] Yes, I am willing to submit a PR!
Code of Conduct
- [X] I agree to follow this project's Code of Conduct.
A complete log of gp_replica_check. gp_replica_check.txt
show fsync; ?
@yjhjstz As we discussed, I was unable to reproduce the issue with the default fsync=off setting when I ran it manually. It would be acceptable to check it when the PR is merged. Will reopen if this reoccurs.
(Thread-26) Host: cbdb-release-pipeline-83598-job-981711, Port: 7000, Database: contrib_regression
Primary Data Directory Location: /code/cbdb_src/gpAux/gpdemo/datadirs/qddir/demoDataDir-1
Mirror Data Directory Location: /code/cbdb_src/gpAux/gpdemo/datadirs/standby
gp_replica_check
------------------
t
(1 row)
(Thread-24) Host: cbdb-release-pipeline-83598-job-981711, Port: 7000, Database: gptest
Primary Data Directory Location: /code/cbdb_src/gpAux/gpdemo/datadirs/qddir/demoDataDir-1
Mirror Data Directory Location: /code/cbdb_src/gpAux/gpdemo/datadirs/standby
gp_replica_check
------------------
t
(1 row)
(Thread-30) Host: cbdb-release-pipeline-83598-job-981711, Port: 7004, Database: dsp1
Primary Data Directory Location: /code/cbdb_src/gpAux/gpdemo/datadirs/dbfast3/demoDataDir2
Mirror Data Directory Location: /code/cbdb_src/gpAux/gpdemo/datadirs/dbfast_mirror3/demoDataDir2
gp_replica_check
------------------
t
(1 row)
(Thread-25) Host: cbdb-release-pipeline-83598-job-981711, Port: 7000, Database: reuse_gptest
Primary Data Directory Location: /code/cbdb_src/gpAux/gpdemo/datadirs/qddir/demoDataDir-1
Mirror Data Directory Location: /code/cbdb_src/gpAux/gpdemo/datadirs/standby
gp_replica_check
------------------
t
(1 row)
(Thread-28) Host: cbdb-release-pipeline-83598-job-981711, Port: 7004, Database: template1
Primary Data Directory Location: /code/cbdb_src/gpAux/gpdemo/datadirs/dbfast3/demoDataDir2
Mirror Data Directory Location: /code/cbdb_src/gpAux/gpdemo/datadirs/dbfast_mirror3/demoDataDir2
gp_replica_check
------------------
t
(1 row)
(Thread-29) Host: cbdb-release-pipeline-83598-job-981711, Port: 7004, Database: gpadmin
Primary Data Directory Location: /code/cbdb_src/gpAux/gpdemo/datadirs/dbfast3/demoDataDir2
Mirror Data Directory Location: /code/cbdb_src/gpAux/gpdemo/datadirs/dbfast_mirror3/demoDataDir2
gp_replica_check
------------------
t
(1 row)
(Thread-31) Host: cbdb-release-pipeline-83598-job-981711, Port: 7004, Database: dsp2
Primary Data Directory Location: /code/cbdb_src/gpAux/gpdemo/datadirs/dbfast3/demoDataDir2
Mirror Data Directory Location: /code/cbdb_src/gpAux/gpdemo/datadirs/dbfast_mirror3/demoDataDir2
gp_replica_check
------------------
t
(1 row)
Hi @yjhjstz Jiang Hua, I have to reopen this issue as after the code merged, we still see the issue.