edgedb
edgedb copied to clipboard
HA tests broken since we switched it to postgres 15
I looked into it a bit but don't really know anything about any of the components, so I'm timing out and handing it over to @fantix. (@fantix, if you don't have cycles, sync back up with me)
From my debugging, it looks like after we've done one failover and are trying to restart the original master, it never comes out of the "initMode": "resync" state. (I experimented with just allowing that, and that seems wrong.)
We get this debug output:
2023-10-20T15:16:33.957-0700 WARN cmd/keeper.go:1987 provided --pg-listen-address "127.0.0.1" is a loopback ip. This will be advertized to the other components and communication will fail if they are on different hosts
2023-10-20T15:16:33.963-0700 ERROR cmd/keeper.go:720 cannot get configured pg parameters {"error": "dial unix /tmp/.s.PGSQL.57901: connect: no such file or directory"}
2023-10-20T15:16:34.057-0700 INFO cmd/sentinel.go:1151 removing old master db {"db": "424eea6e", "keeper": "pg57901"}
2023-10-20T15:16:34.057-0700 INFO cmd/sentinel.go:1503 added new standby db {"db": "ed7a91c7", "keeper": "pg57901"}
2023-10-20T15:16:34.214-0700 ERROR cmd/keeper.go:720 cannot get configured pg parameters {"error": "dial unix /tmp/.s.PGSQL.57901: connect: no such file or directory"}
2023-10-20T15:16:34.465-0700 ERROR cmd/keeper.go:720 cannot get configured pg parameters {"error": "dial unix /tmp/.s.PGSQL.57901: connect: no such file or directory"}
pg_basebackup: initiating base backup, waiting for checkpoint to complete
2023-10-20T15:16:34.568-0700 WARN cmd/sentinel.go:287 received db state for unexpected db uid {"receivedDB": "424eea6e", "db": "ed7a91c7", "keeper": "pg57901"}
pg_basebackup seems to run, and is still running when it all times out