stolon icon indicating copy to clipboard operation
stolon copied to clipboard

Cluster becomes unhealthy even though there is a healthy member

Open viggy28 opened this issue 3 years ago • 1 comments

What happened: In a 3 member synchronous replication cluster, primary and sync replica failed within a short duration. Stolon elected the SR (which is in fact failing to report healthy however it's not yet reached the failInterval). I disabled the synchronous replication, however still the cluster couldn't promote the healthy ASR to become new master

What you expected to happen: If a cluster is synchronous replication disabled, then stolon should elect any healthy member to be a master.

How to reproduce it (as minimally and precisely as possible):

  1. Start a 3 node cluster with Synchronous Replication enabled
  2. Stop the primary first
  3. Wait for 5 seconds
  4. Stop the Synchronous Replica
  5. Disable Synchronous Replication
  6. You should see sentinel erroring that there is no eligible master.

Anything else we need to know?:

Environment: Darwin

  • Stolon version:
user@localhost ~ % /Users/vigneshravichandran/sourcecontrol/github.com/stolon/bin/stolonctl version
stolonctl version fc23394877ed4cbe986f0579f19525f9776846f7
  • Stolon running environment (if useful to understand the bug): binary on baremetal
  • Others: sentinel logs
2021-10-05T14:45:29.708-0700	WARN	cmd/sentinel.go:276	no keeper info available	{"db": "f6e8dfba", "keeper": "postgres2"}
2021-10-05T14:45:35.014-0700	WARN	cmd/sentinel.go:276	no keeper info available	{"db": "edf43366", "keeper": "postgres1"}
2021-10-05T14:45:35.014-0700	WARN	cmd/sentinel.go:276	no keeper info available	{"db": "f6e8dfba", "keeper": "postgres2"}
2021-10-05T14:45:40.323-0700	WARN	cmd/sentinel.go:276	no keeper info available	{"db": "edf43366", "keeper": "postgres1"}
2021-10-05T14:45:40.323-0700	WARN	cmd/sentinel.go:276	no keeper info available	{"db": "f6e8dfba", "keeper": "postgres2"}
2021-10-05T14:45:45.571-0700	WARN	cmd/sentinel.go:276	no keeper info available	{"db": "edf43366", "keeper": "postgres1"}
2021-10-05T14:45:45.571-0700	WARN	cmd/sentinel.go:276	no keeper info available	{"db": "f6e8dfba", "keeper": "postgres2"}
2021-10-05T14:45:50.834-0700	WARN	cmd/sentinel.go:276	no keeper info available	{"db": "edf43366", "keeper": "postgres1"}
2021-10-05T14:45:50.834-0700	WARN	cmd/sentinel.go:276	no keeper info available	{"db": "f6e8dfba", "keeper": "postgres2"}
2021-10-05T14:45:50.841-0700	INFO	cmd/sentinel.go:995	master db is failed	{"db": "f6e8dfba", "keeper": "postgres2"}
2021-10-05T14:45:50.842-0700	INFO	cmd/sentinel.go:1006	trying to find a new master to replace failed master
2021-10-05T14:45:50.842-0700	INFO	cmd/sentinel.go:1032	electing db as the new master	{"db": "edf43366", "keeper": "postgres1"}
2021-10-05T14:45:55.982-0700	WARN	cmd/sentinel.go:276	no keeper info available	{"db": "edf43366", "keeper": "postgres1"}
2021-10-05T14:45:55.982-0700	WARN	cmd/sentinel.go:276	no keeper info available	{"db": "f6e8dfba", "keeper": "postgres2"}
2021-10-05T14:45:55.990-0700	INFO	cmd/sentinel.go:995	master db is failed	{"db": "edf43366", "keeper": "postgres1"}
2021-10-05T14:45:55.990-0700	INFO	cmd/sentinel.go:1006	trying to find a new master to replace failed master
2021-10-05T14:45:55.990-0700	WARN	cmd/sentinel.go:1016	cannot choose synchronous standby since there are no common elements between the latest master reported synchronous standbys and the db spec ones	{"reported": [], "spec": ["f6e8dfba"]}
2021-10-05T14:45:55.991-0700	ERROR	cmd/sentinel.go:1035	no eligible masters
2021-10-05T14:46:01.349-0700	WARN	cmd/sentinel.go:276	no keeper info available	{"db": "edf43366", "keeper": "postgres1"}
2021-10-05T14:46:01.349-0700	WARN	cmd/sentinel.go:276	no keeper info available	{"db": "f6e8dfba", "keeper": "postgres2"}
2021-10-05T14:46:01.357-0700	INFO	cmd/sentinel.go:995	master db is failed	{"db": "edf43366", "keeper": "postgres1"}
2021-10-05T14:46:01.357-0700	INFO	cmd/sentinel.go:1006	trying to find a new master to replace failed master
2021-10-05T14:46:01.357-0700	WARN	cmd/sentinel.go:1016	cannot choose synchronous standby since there are no common elements between the latest master reported synchronous standbys and the db spec ones	{"reported": [], "spec": ["f6e8dfba"]}
2021-10-05T14:46:01.357-0700	ERROR	cmd/sentinel.go:1035	no eligible masters
2021-10-05T14:46:06.617-0700	WARN	cmd/sentinel.go:276	no keeper info available	{"db": "edf43366", "keeper": "postgres1"}
2021-10-05T14:46:06.617-0700	WARN	cmd/sentinel.go:276	no keeper info available	{"db": "f6e8dfba", "keeper": "postgres2"}
2021-10-05T14:46:06.625-0700	INFO	cmd/sentinel.go:995	master db is failed	{"db": "edf43366", "keeper": "postgres1"}
2021-10-05T14:46:06.625-0700	INFO	cmd/sentinel.go:1006	trying to find a new master to replace failed master
2021-10-05T14:46:06.625-0700	WARN	cmd/sentinel.go:1016	cannot choose synchronous standby since there are no common elements between the latest master reported synchronous standbys and the db spec ones	{"reported": [], "spec": ["f6e8dfba"]}

Wondering why are we checking here current master is synchronous replication instead of what the cluster spec is?
https://github.com/sorintlab/stolon/blob/master/cmd/sentinel/cmd/sentinel.go#L1013

I have manually updated the keeper clusterdata spec as synchronousReplication: false which made the sentinel pick the ASR as the new master.

cc @sgotti

viggy28 avatar Oct 05 '21 22:10 viggy28

@viggy28 That's probably a missing check. Feel free add a check on the wanted sync repl in the cluster spec and see if all tests are OK also with the change or if there're some changed behaviors that require a more deep analysis. Then please open a PR.

sgotti avatar Oct 06 '21 07:10 sgotti