stolon
stolon copied to clipboard
Cluster becomes unhealthy even though there is a healthy member
What happened:
In a 3 member synchronous replication cluster, primary and sync replica failed within a short duration. Stolon elected the SR (which is in fact failing to report healthy however it's not yet reached the failInterval
). I disabled the synchronous replication, however still the cluster couldn't promote the healthy ASR to become new master
What you expected to happen: If a cluster is synchronous replication disabled, then stolon should elect any healthy member to be a master.
How to reproduce it (as minimally and precisely as possible):
- Start a 3 node cluster with Synchronous Replication enabled
- Stop the primary first
- Wait for 5 seconds
- Stop the Synchronous Replica
- Disable Synchronous Replication
- You should see sentinel erroring that there is no eligible master.
Anything else we need to know?:
Environment: Darwin
- Stolon version:
user@localhost ~ % /Users/vigneshravichandran/sourcecontrol/github.com/stolon/bin/stolonctl version
stolonctl version fc23394877ed4cbe986f0579f19525f9776846f7
- Stolon running environment (if useful to understand the bug): binary on baremetal
- Others: sentinel logs
2021-10-05T14:45:29.708-0700 WARN cmd/sentinel.go:276 no keeper info available {"db": "f6e8dfba", "keeper": "postgres2"}
2021-10-05T14:45:35.014-0700 WARN cmd/sentinel.go:276 no keeper info available {"db": "edf43366", "keeper": "postgres1"}
2021-10-05T14:45:35.014-0700 WARN cmd/sentinel.go:276 no keeper info available {"db": "f6e8dfba", "keeper": "postgres2"}
2021-10-05T14:45:40.323-0700 WARN cmd/sentinel.go:276 no keeper info available {"db": "edf43366", "keeper": "postgres1"}
2021-10-05T14:45:40.323-0700 WARN cmd/sentinel.go:276 no keeper info available {"db": "f6e8dfba", "keeper": "postgres2"}
2021-10-05T14:45:45.571-0700 WARN cmd/sentinel.go:276 no keeper info available {"db": "edf43366", "keeper": "postgres1"}
2021-10-05T14:45:45.571-0700 WARN cmd/sentinel.go:276 no keeper info available {"db": "f6e8dfba", "keeper": "postgres2"}
2021-10-05T14:45:50.834-0700 WARN cmd/sentinel.go:276 no keeper info available {"db": "edf43366", "keeper": "postgres1"}
2021-10-05T14:45:50.834-0700 WARN cmd/sentinel.go:276 no keeper info available {"db": "f6e8dfba", "keeper": "postgres2"}
2021-10-05T14:45:50.841-0700 INFO cmd/sentinel.go:995 master db is failed {"db": "f6e8dfba", "keeper": "postgres2"}
2021-10-05T14:45:50.842-0700 INFO cmd/sentinel.go:1006 trying to find a new master to replace failed master
2021-10-05T14:45:50.842-0700 INFO cmd/sentinel.go:1032 electing db as the new master {"db": "edf43366", "keeper": "postgres1"}
2021-10-05T14:45:55.982-0700 WARN cmd/sentinel.go:276 no keeper info available {"db": "edf43366", "keeper": "postgres1"}
2021-10-05T14:45:55.982-0700 WARN cmd/sentinel.go:276 no keeper info available {"db": "f6e8dfba", "keeper": "postgres2"}
2021-10-05T14:45:55.990-0700 INFO cmd/sentinel.go:995 master db is failed {"db": "edf43366", "keeper": "postgres1"}
2021-10-05T14:45:55.990-0700 INFO cmd/sentinel.go:1006 trying to find a new master to replace failed master
2021-10-05T14:45:55.990-0700 WARN cmd/sentinel.go:1016 cannot choose synchronous standby since there are no common elements between the latest master reported synchronous standbys and the db spec ones {"reported": [], "spec": ["f6e8dfba"]}
2021-10-05T14:45:55.991-0700 ERROR cmd/sentinel.go:1035 no eligible masters
2021-10-05T14:46:01.349-0700 WARN cmd/sentinel.go:276 no keeper info available {"db": "edf43366", "keeper": "postgres1"}
2021-10-05T14:46:01.349-0700 WARN cmd/sentinel.go:276 no keeper info available {"db": "f6e8dfba", "keeper": "postgres2"}
2021-10-05T14:46:01.357-0700 INFO cmd/sentinel.go:995 master db is failed {"db": "edf43366", "keeper": "postgres1"}
2021-10-05T14:46:01.357-0700 INFO cmd/sentinel.go:1006 trying to find a new master to replace failed master
2021-10-05T14:46:01.357-0700 WARN cmd/sentinel.go:1016 cannot choose synchronous standby since there are no common elements between the latest master reported synchronous standbys and the db spec ones {"reported": [], "spec": ["f6e8dfba"]}
2021-10-05T14:46:01.357-0700 ERROR cmd/sentinel.go:1035 no eligible masters
2021-10-05T14:46:06.617-0700 WARN cmd/sentinel.go:276 no keeper info available {"db": "edf43366", "keeper": "postgres1"}
2021-10-05T14:46:06.617-0700 WARN cmd/sentinel.go:276 no keeper info available {"db": "f6e8dfba", "keeper": "postgres2"}
2021-10-05T14:46:06.625-0700 INFO cmd/sentinel.go:995 master db is failed {"db": "edf43366", "keeper": "postgres1"}
2021-10-05T14:46:06.625-0700 INFO cmd/sentinel.go:1006 trying to find a new master to replace failed master
2021-10-05T14:46:06.625-0700 WARN cmd/sentinel.go:1016 cannot choose synchronous standby since there are no common elements between the latest master reported synchronous standbys and the db spec ones {"reported": [], "spec": ["f6e8dfba"]}
Wondering why are we checking here current master is synchronous replication instead of what the cluster spec is?
https://github.com/sorintlab/stolon/blob/master/cmd/sentinel/cmd/sentinel.go#L1013
I have manually updated the keeper clusterdata spec as synchronousReplication: false
which made the sentinel pick the ASR as the new master.
cc @sgotti
@viggy28 That's probably a missing check. Feel free add a check on the wanted sync repl in the cluster spec and see if all tests are OK also with the change or if there're some changed behaviors that require a more deep analysis. Then please open a PR.