HA tests broken since we switched it to postgres 15

Open msullivan opened this issue 2 years ago • 0 comments

I looked into it a bit but don't really know anything about any of the components, so I'm timing out and handing it over to @fantix. (@fantix, if you don't have cycles, sync back up with me)

From my debugging, it looks like after we've done one failover and are trying to restart the original master, it never comes out of the "initMode": "resync" state. (I experimented with just allowing that, and that seems wrong.)

We get this debug output:

2023-10-20T15:16:33.957-0700	WARN	cmd/keeper.go:1987	provided --pg-listen-address "127.0.0.1" is a loopback ip. This will be advertized to the other components and communication will fail if they are on different hosts
2023-10-20T15:16:33.963-0700	ERROR	cmd/keeper.go:720	cannot get configured pg parameters	{"error": "dial unix /tmp/.s.PGSQL.57901: connect: no such file or directory"}
2023-10-20T15:16:34.057-0700	INFO	cmd/sentinel.go:1151	removing old master db	{"db": "424eea6e", "keeper": "pg57901"}
2023-10-20T15:16:34.057-0700	INFO	cmd/sentinel.go:1503	added new standby db	{"db": "ed7a91c7", "keeper": "pg57901"}
2023-10-20T15:16:34.214-0700	ERROR	cmd/keeper.go:720	cannot get configured pg parameters	{"error": "dial unix /tmp/.s.PGSQL.57901: connect: no such file or directory"}
2023-10-20T15:16:34.465-0700	ERROR	cmd/keeper.go:720	cannot get configured pg parameters	{"error": "dial unix /tmp/.s.PGSQL.57901: connect: no such file or directory"}
pg_basebackup: initiating base backup, waiting for checkpoint to complete
2023-10-20T15:16:34.568-0700	WARN	cmd/sentinel.go:287	received db state for unexpected db uid	{"receivedDB": "424eea6e", "db": "ed7a91c7", "keeper": "pg57901"}

pg_basebackup seems to run, and is still running when it all times out

Oct 20 '23 22:10 msullivan