[Bug] unstable pg_upgrade failed
Cloudberry Database version
No response
What happened
We suffer it for a long time
pg_upgrade failed
psql: error: connection to server on socket "/tmp/.s.PGSQL.17432" failed: No such file or directory
[6694](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6695) Is the server running locally and accepting connections on that socket?
[6695](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6696)======================================================================
[6696](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6697)
[6697](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6698)20231024:08:45:55:017476 gpstop:ip-10-0-1-232:gpadmin-[INFO]:-Starting gpstop with args: -a
[6698](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6699)20231024:08:45:55:017476 gpstop:ip-10-0-1-232:gpadmin-[INFO]:-Gathering information and validating the environment...
[6699](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6700)Error: 4:08:45:55:017476 gpstop:ip-10-0-1-232:gpadmin-[ERROR]:-gpstop error: postmaster.pid file does not exist. is Cloudberry instance already stopped?
[6700](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6701)/code/gpdb_src/src/bin/pg_upgrade/tmp_check/upgrade/qd /code/gpdb_src/src/bin/pg_upgrade
[6701](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6702)Performing Consistency Checks
[6702](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6703)-----------------------------
[6703](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6704)Checking cluster versions ok
[6704](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6705)
[6705](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6706)The target cluster was not shut down cleanly.
[6706](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6707)Failure, exiting
[6707](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6708)
[6708](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6709)ERROR: Failure encountered in upgrading qd node
[6709](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6710)real 0m0.050s
[6710](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6711)user 0m0.019s
[6711](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6712)sys 0m0.030s
[6712](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6713)/code/gpdb_src/src/bin/pg_upgrade /code/gpdb_src/src/bin/pg_upgrade/tmp_check/upgrade/qd /code/gpdb_src/src/bin/pg_upgrade
[6713](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6714)make[1]: *** [Makefile:78: check] Error 1
[6714](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6715)make: *** [GNUmakefile:194: installcheck-world-src/bin/pg_upgrade-recurse] Error 2
[6715](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6716)make: Target 'installcheck-world' not remade because of errors.
20231024:08:35:54:031540 gpstart:ip-10-0-1-232:gpadmin-[INFO]:-CoordinatorStart pg_ctl cmd is env GPSESSID=0000000000 GPERA=01d1134bbbff0ed5_231024083553 $GPHOME/bin/pg_ctl -D /code/gpdb_src/src/bin/pg_upgrade/tmp_check/datadirs/qddir/demoDataDir-1 -l /code/gpdb_src/src/bin/pg_upgrade/tmp_check/datadirs/qddir/demoDataDir-1/log/startup.log -w -t 600 -o " -p 17432 -c gp_role=dispatch " start
[6642](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6643)20231024:08:45:54:031540 gpstart:ip-10-0-1-232:gpadmin-[CRITICAL]:-Error occurred: non-zero rc: 1
[6643](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6644) Command was: 'env GPSESSID=0000000000 GPERA=01d1134bbbff0ed5_231024083553 $GPHOME/bin/pg_ctl -D /code/gpdb_src/src/bin/pg_upgrade/tmp_check/datadirs/qddir/demoDataDir-1 -l /code/gpdb_src/src/bin/pg_upgrade/tmp_check/datadirs/qddir/demoDataDir-1/log/startup.log -w -t 600 -o " -p 17432 -c gp_role=dispatch " start'
[6644](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6645)rc=1, stdout='waiting for server to start........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................... stopped waiting
[6645](https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260#step:5:6646)', stderr='pg_ctl: server did not start in time
------------------------
It seems gpstart timeout after switch binary from gpdb5 -> gpdb6
What you think should happen instead
No response
How to reproduce
https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260
Operating System
https://github.com/cloudberrydb/cloudberrydb/actions/runs/6623396719/job/17990808401?pr=260
Anything else
No response
Are you willing to submit PR?
- [ ] Yes, I am willing to submit a PR!
Code of Conduct
- [X] I agree to follow this project's Code of Conduct.
AFATK, add timeout could not resolve this issue, and -t 600 comes from gpstart's param in CBDB CI. If we change it, all components are affected.
Update: we have Increased CI resources, try to fix it.
increasing ci resources doesn't help. https://github.com/cloudberrydb/cloudberrydb/actions/runs/6744120555/job/18339332696
another failed: https://github.com/cloudberrydb/cloudberrydb/actions/runs/6751856892/job/18383806659?pr=284
increasing ci resources doesn't help. https://github.com/cloudberrydb/cloudberrydb/actions/runs/6744120555/job/18339332696
db internal log can be downloaded at https://github.com/cloudberrydb/cloudberrydb/suites/17879327237/artifacts/1027090491
The problem Ray-Eldath mentioned is not the same problem as this. His problem is disscussed in the latter. The standby QE is not ready for connection when the QD is send the request For test, maybe we should sleep for a while. Better solution may be that QD fts wait for standby QE to ready.
2023-11-03 15:06:33.031038 UTC,,,p31860,th-841484160,,,,0,,,seg1,,,,,"LOG","00000","database system is ready",,,,,,,0,,"xlog.c",8477, 2023-11-03 15:06:33.034261 UTC,"gpadmin","isolation2test",p32307,th-841484160,"10.0.2.31","47820",2023-11-03 15:06:33 UTC,0,con266,,seg1,,,,,"FATAL","57P03","the database system is not accepting connections","Hot standby mode is disabled.",,,,,,0,,"postmaster.c",2747, 2023-11-03 15:06:33.034283 UTC,,,p31853,th-841484160,,,,0,,,seg1,,,,,"LOG","00000","PostgreSQL 14.4 (Cloudberry Database 1.0.0 build 6744120555) on x86_64-pc-linux-gnu, compiled by gcc (GCC) 10.2.1 20210130 (Red Hat 10.2.1-11), 64-bit compiled on Nov 3 2023 13:46:54 (with assert checking)",,,,,,,0,,"postmaster.c",3556, 2023-11-03 15:06:33.034293 UTC,,,p31853,th-841484160,,,,0,,,seg1,,,,,"LOG","00000","database system is ready to accept connections","PostgreSQL 14.4 (Cloudberry Database 1.0.0 build 6744120555) on x86_64-pc-linux-gnu, compiled by gcc (GCC) 10.2.1 20210130 (Red Hat 10.2.1-11), 64-bit compiled on Nov 3 2023 13:46:54 (with assert checking)",,,,,,0,,"postmaster.c",3558,
another failure: https://github.com/cloudberrydb/cloudberrydb/actions/runs/6794967552/job/18472551915?pr=290
pg_upgrade failed once again https://github.com/cloudberrydb/cloudberrydb/actions/runs/6820307820/job/18549229016?pr=294
https://github.com/cloudberrydb/cloudberrydb/actions/runs/6822019838/job/18553628928
https://github.com/cloudberrydb/cloudberrydb/actions/runs/6858612287/job/18649896679?pr=298
@smartyhero please try set MAX_CONNECTIONS = 5 or 10 in workflows/release.yml to control resources.
Okay, I'll take care of it
PR has been created: https://github.com/cloudberrydb/cloudberrydb/pull/308
two unstable tests (this one and #301) which both due to occupied port on the vm do not reoccur since the vm image gets rebuilt yesterday. this is kinda strange because the only change during that rebuild was add tmux as a new package...
keep rerunning in #306 to see whether this resurface.
...and it failed in no time :-( https://github.com/cloudberrydb/cloudberrydb/actions/runs/6938950626/job/18875704817?pr=306
@jiaqizho have found some useful info:
https://github.com/cloudberrydb/cloudberrydb/pull/515#issuecomment-2224388539
Noting that we have ignored this test case, can we close this issue? @avamingli
Noting that we have ignored this test case, can we close this issue? @avamingli
Sure.