stolon
stolon copied to clipboard
Keeper priority
Add --priority keeper option.
Sentinel will promote keeper with higher priority than the current one if this is possible. In async mode this is a bit non-deterministic because we always elect node with highest LSN, and under heavy load prioritized node might never report LSN higher than its stronger competitors. However, if nodes are equal this should happen at some moment. In sync mode, we can just elect any of synchronous standbies.
Implements sorintlab#492
While here, make eyeballing of sentinel tests failures easier.
semaphoreci failed at config_test.go:541 in one of the configurations, but I don't think this is related to this pull request -- seems like the database just couldn't restart in 10 seconds. Is there a way to restart the test?
Any updates on this PR? Need this feature.
@arssher can you update this PR?
Hi, I will try to look into this in a few days.
Sorry for the quite a delay. I have updated the PR:
- Addressed comments above.
- Rebased on current master.
- Added facility to update keeper priority online (new command
stolonctl setkeeperpriority
). It allows to update priority without restarting the keeper (and underlying Postgres instance), which can be used for controlled failover. My colleague @maksm90 hinted me it would be a nice addition...
BTW, when generating docs (gen_commands_doc.sh) I have noticed that it wasn't updated for some lately added keeper options, e.g. pg-advertise-address. I have excluded those from this PR also as I believe they should go in separate commit.
Also, does it make sense to restamp all docs files with current date like "Auto generated by spf13/cobra on 30-Jun-2019"? I can revert them, leaving only ones who have really changed.
(Made a small cleanup just now, used DefaultPriority const and purged obsolete NotSpecifiedPriority const)
One more minor fix/cleanup: forgot to add generated doc for setkeeperpriority
command and used DefaultPriority const in keeper help.
Last CI build failed on Consul and passed all other stores. Looking at logs, I think this PR is not a reason of the failure again. Error was at ha_test.go:1574 in TestKeeperRemovalStolonCtl. removekeeper failed with:
utils.go:881: executing stolonctl, args: [--cluster-name=fbfeb889-37b5-402a-89c3-e9acdaf0e248 --store-backend=consul --store-endpoints=127.0.0.1:2278 removekeeper 7c99557a]
utils.go:897: [stolonctl]: cannot update cluster data: Unable to complete atomic operation, key modified
ha_test.go:1576: unexpected err: exit status 1
So most probably stolonctl
got interleaved with sentinel's clusterdata update.
Hm, now two builds failed. One of them is exactly as previous: TestKeeperRemovalStolonCtl at ha_test.go:1576.
Another is TestForceFailSyncReplStandbyCluster, at
ha_test.go:1893: expected master "1a1cd8ab" in cluster view
Which means sentinel failed to re-elect must after forcefail:
utils.go:256: [sentinel dc41d2d6]: 2019-07-01T08:16:13.737Z [34mINFO[0m cmd/sentinel.go:1027 master db is failed {"db": "1c4f397a", "keeper": "4a4c481b"}
utils.go:256: [sentinel dc41d2d6]: 2019-07-01T08:16:13.738Z [31mERROR[0m cmd/sentinel.go:768 no eligible masters
It is unclear to me why this happened. Since I didn't touch findBestNewMasters
function, most probably it is not me, but still... can we rerun the tests with debug log level enabled to see what's happening inside findBestNewMasters
?
@sgotti would you please check this PR?
@sgotti Any updates on this?
@rearden-steel this PR requires an update/rebase and we have to understand if this will cover also #696. Are you willing to take care of this PR? @lawrencejones @maksm90
@sgotti, I've rebased this branch. I had to modify one place because of a linter error, now all existing tests pass and
make test
INTEGRATION=1 STOLON_TEST_STORE_BACKEND=etcdv3 ./test
works well for me. As for CI, I've found it to be quite unstable these days: first it failed on build / make test
with zero output; after re-pushing the brach with --force without changes all tests passed; now, I modified only a single comment and TestProxyListening
failed with
TestProxyListening: utils.go:1090: err: context deadline exceeded
TestProxyListening: proxy_test.go:164: error waiting on store up: timeout
TestProxyListening: utils.go:300: stopping etcd e82d334a
which seems to be not a matter of this PR and means that etcd
just failed to start before timeout fired, doesn't it?
BTW, in the #696 you mentioned that it worth to check compatibility of this PR with 05b1b0f595f39d5d2059d8f92a0bb012bb579f9d. I had a look on it, and it seems that everything works fine even without changes, since added in this PR findBestNewMaster
uses findBestNewMasters
, which simply skips master candidates with --can-be-master
. That way, a keeper with both --priority=0
and --can-be-master=false
wouldn't be promoted.
However, it may be a bit misleading for a user, so maybe we should restrict usages of both --priority=0
and --can-be-master=false
simultaneously? What do you think, @sgotti @rearden-steel @arssher?