cloudberry icon indicating copy to clipboard operation
cloudberry copied to clipboard

[Bug] unstable pg_upgrade failed

Open avamingli opened this issue 2 years ago • 15 comments

Cloudberry Database version

No response

What happened

We suffer it for a long time

pg_upgrade failed
psql: error: connection to server on socket "/tmp/.s.PGSQL.17432" failed: No such file or directory
[6694](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6695)        Is the server running locally and accepting connections on that socket?
[6695](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6696)======================================================================
[6696](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6697)
[6697](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6698)20231024:08:45:55:017476 gpstop:ip-10-0-1-232:gpadmin-[INFO]:-Starting gpstop with args: -a
[6698](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6699)20231024:08:45:55:017476 gpstop:ip-10-0-1-232:gpadmin-[INFO]:-Gathering information and validating the environment...
[6699](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6700)Error: 4:08:45:55:017476 gpstop:ip-10-0-1-232:gpadmin-[ERROR]:-gpstop error: postmaster.pid file does not exist.  is Cloudberry instance already stopped?
[6700](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6701)/code/gpdb_src/src/bin/pg_upgrade/tmp_check/upgrade/qd /code/gpdb_src/src/bin/pg_upgrade
[6701](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6702)Performing Consistency Checks
[6702](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6703)-----------------------------
[6703](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6704)Checking cluster versions                                   ok
[6704](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6705)
[6705](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6706)The target cluster was not shut down cleanly.
[6706](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6707)Failure, exiting
[6707](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6708)
[6708](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6709)ERROR: Failure encountered in upgrading qd node
[6709](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6710)real        0m0.050s
[6710](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6711)user        0m0.019s
[6711](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6712)sys        0m0.030s
[6712](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6713)/code/gpdb_src/src/bin/pg_upgrade /code/gpdb_src/src/bin/pg_upgrade/tmp_check/upgrade/qd /code/gpdb_src/src/bin/pg_upgrade
[6713](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6714)make[1]: *** [Makefile:78: check] Error 1
[6714](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6715)make: *** [GNUmakefile:194: installcheck-world-src/bin/pg_upgrade-recurse] Error 2
[6715](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6716)make: Target 'installcheck-world' not remade because of errors.
20231024:08:35:54:031540 gpstart:ip-10-0-1-232:gpadmin-[INFO]:-CoordinatorStart pg_ctl cmd is env GPSESSID=0000000000 GPERA=01d1134bbbff0ed5_231024083553 $GPHOME/bin/pg_ctl -D /code/gpdb_src/src/bin/pg_upgrade/tmp_check/datadirs/qddir/demoDataDir-1 -l /code/gpdb_src/src/bin/pg_upgrade/tmp_check/datadirs/qddir/demoDataDir-1/log/startup.log -w -t 600 -o " -p 17432 -c gp_role=dispatch " start
[6642](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6643)20231024:08:45:54:031540 gpstart:ip-10-0-1-232:gpadmin-[CRITICAL]:-Error occurred: non-zero rc: 1
[6643](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6644) Command was: 'env GPSESSID=0000000000 GPERA=01d1134bbbff0ed5_231024083553 $GPHOME/bin/pg_ctl -D /code/gpdb_src/src/bin/pg_upgrade/tmp_check/datadirs/qddir/demoDataDir-1 -l /code/gpdb_src/src/bin/pg_upgrade/tmp_check/datadirs/qddir/demoDataDir-1/log/startup.log -w -t 600 -o " -p 17432 -c gp_role=dispatch " start'
[6644](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6645)rc=1, stdout='waiting for server to start........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................... stopped waiting
[6645](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6646)', stderr='pg_ctl: server did not start in time
------------------------

It seems gpstart timeout after switch binary from gpdb5 -> gpdb6

What you think should happen instead

No response

How to reproduce

https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260

Operating System

https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260

Anything else

No response

Are you willing to submit PR?

  • [ ] Yes, I am willing to submit a PR!

Code of Conduct

avamingli avatar Oct 25 '23 01:10 avamingli

AFATK, add timeout could not resolve this issue, and -t 600 comes from gpstart's param in CBDB CI. If we change it, all components are affected.

avamingli avatar Oct 25 '23 01:10 avamingli

Update: we have Increased CI resources, try to fix it.

avamingli avatar Oct 27 '23 01:10 avamingli

increasing ci resources doesn't help. https://github.com/cloudberrydb/cloudberrydb/actions/runs/6744120555/job/18339332696

Ray-Eldath avatar Nov 06 '23 02:11 Ray-Eldath

another failed: https://github.com/cloudberrydb/cloudberrydb/actions/runs/6751856892/job/18383806659?pr=284

avamingli avatar Nov 06 '23 02:11 avamingli

increasing ci resources doesn't help. https://github.com/cloudberrydb/cloudberrydb/actions/runs/6744120555/job/18339332696

db internal log can be downloaded at https://github.com/cloudberrydb/cloudberrydb/suites/17879327237/artifacts/1027090491

Ray-Eldath avatar Nov 06 '23 02:11 Ray-Eldath

The problem Ray-Eldath mentioned is not the same problem as this. His problem is disscussed in the latter. The standby QE is not ready for connection when the QD is send the request For test, maybe we should sleep for a while. Better solution may be that QD fts wait for standby QE to ready.

2023-11-03 15:06:33.031038 UTC,,,p31860,th-841484160,,,,0,,,seg1,,,,,"LOG","00000","database system is ready",,,,,,,0,,"xlog.c",8477, 2023-11-03 15:06:33.034261 UTC,"gpadmin","isolation2test",p32307,th-841484160,"10.0.2.31","47820",2023-11-03 15:06:33 UTC,0,con266,,seg1,,,,,"FATAL","57P03","the database system is not accepting connections","Hot standby mode is disabled.",,,,,,0,,"postmaster.c",2747, 2023-11-03 15:06:33.034283 UTC,,,p31853,th-841484160,,,,0,,,seg1,,,,,"LOG","00000","PostgreSQL 14.4 (Cloudberry Database 1.0.0 build 6744120555) on x86_64-pc-linux-gnu, compiled by gcc (GCC) 10.2.1 20210130 (Red Hat 10.2.1-11), 64-bit compiled on Nov 3 2023 13:46:54 (with assert checking)",,,,,,,0,,"postmaster.c",3556, 2023-11-03 15:06:33.034293 UTC,,,p31853,th-841484160,,,,0,,,seg1,,,,,"LOG","00000","database system is ready to accept connections","PostgreSQL 14.4 (Cloudberry Database 1.0.0 build 6744120555) on x86_64-pc-linux-gnu, compiled by gcc (GCC) 10.2.1 20210130 (Red Hat 10.2.1-11), 64-bit compiled on Nov 3 2023 13:46:54 (with assert checking)",,,,,,0,,"postmaster.c",3558,

lss602726449 avatar Nov 06 '23 09:11 lss602726449

another failure: https://github.com/cloudberrydb/cloudberrydb/actions/runs/6794967552/job/18472551915?pr=290

Ray-Eldath avatar Nov 08 '23 10:11 Ray-Eldath

pg_upgrade failed once again https://github.com/cloudberrydb/cloudberrydb/actions/runs/6820307820/job/18549229016?pr=294

avamingli avatar Nov 10 '23 04:11 avamingli

https://github.com/cloudberrydb/cloudberrydb/actions/runs/6822019838/job/18553628928

zhangwenchao-123 avatar Nov 10 '23 09:11 zhangwenchao-123

https://github.com/cloudberrydb/cloudberrydb/actions/runs/6858612287/job/18649896679?pr=298

avamingli avatar Nov 14 '23 04:11 avamingli

@smartyhero please try set MAX_CONNECTIONS = 5 or 10 in workflows/release.yml to control resources.

yjhjstz avatar Nov 17 '23 02:11 yjhjstz

Okay, I'll take care of it

smartyhero avatar Nov 20 '23 02:11 smartyhero

PR has been created: https://github.com/cloudberrydb/cloudberrydb/pull/308

smartyhero avatar Nov 20 '23 03:11 smartyhero

two unstable tests (this one and #301) which both due to occupied port on the vm do not reoccur since the vm image gets rebuilt yesterday. this is kinda strange because the only change during that rebuild was add tmux as a new package...

keep rerunning in #306 to see whether this resurface.


...and it failed in no time :-( https://github.com/cloudberrydb/cloudberrydb/actions/runs/6938950626/job/18875704817?pr=306

Ray-Eldath avatar Nov 21 '23 03:11 Ray-Eldath

@jiaqizho have found some useful info:

https://github.com/cloudberrydb/cloudberrydb/pull/515#issuecomment-2224388539

avamingli avatar Jul 12 '24 03:07 avamingli

Noting that we have ignored this test case, can we close this issue? @avamingli

gongxun0928 avatar Oct 09 '24 03:10 gongxun0928

Noting that we have ignored this test case, can we close this issue? @avamingli

Sure.

avamingli avatar Oct 09 '24 04:10 avamingli