stolon Keeper priority

Add --priority keeper option.

Sentinel will promote keeper with higher priority than the current one if this is possible. In async mode this is a bit non-deterministic because we always elect node with highest LSN, and under heavy load prioritized node might never report LSN higher than its stronger competitors. However, if nodes are equal this should happen at some moment. In sync mode, we can just elect any of synchronous standbies.

Implements sorintlab#492

While here, make eyeballing of sentinel tests failures easier.

Mar 11 '19 13:03 arssher

semaphoreci failed at config_test.go:541 in one of the configurations, but I don't think this is related to this pull request -- seems like the database just couldn't restart in 10 seconds. Is there a way to restart the test?

Mar 12 '19 09:03 arssher

Any updates on this PR? Need this feature.

Apr 18 '19 17:04 rearden-steel

@arssher can you update this PR?

Jun 26 '19 18:06 rearden-steel

Hi, I will try to look into this in a few days.

Jun 26 '19 18:06 arssher

Sorry for the quite a delay. I have updated the PR:

Addressed comments above.
Rebased on current master.
Added facility to update keeper priority online (new command stolonctl setkeeperpriority). It allows to update priority without restarting the keeper (and underlying Postgres instance), which can be used for controlled failover. My colleague @maksm90 hinted me it would be a nice addition...

Jun 30 '19 14:06 arssher

BTW, when generating docs (gen_commands_doc.sh) I have noticed that it wasn't updated for some lately added keeper options, e.g. pg-advertise-address. I have excluded those from this PR also as I believe they should go in separate commit.

Also, does it make sense to restamp all docs files with current date like "Auto generated by spf13/cobra on 30-Jun-2019"? I can revert them, leaving only ones who have really changed.

Jun 30 '19 14:06 arssher

(Made a small cleanup just now, used DefaultPriority const and purged obsolete NotSpecifiedPriority const)

Jul 01 '19 06:07 arssher

One more minor fix/cleanup: forgot to add generated doc for setkeeperpriority command and used DefaultPriority const in keeper help.

Last CI build failed on Consul and passed all other stores. Looking at logs, I think this PR is not a reason of the failure again. Error was at ha_test.go:1574 in TestKeeperRemovalStolonCtl. removekeeper failed with:

    utils.go:881: executing stolonctl, args: [--cluster-name=fbfeb889-37b5-402a-89c3-e9acdaf0e248 --store-backend=consul --store-endpoints=127.0.0.1:2278 removekeeper 7c99557a]
    utils.go:897: [stolonctl]: cannot update cluster data: Unable to complete atomic operation, key modified
    ha_test.go:1576: unexpected err: exit status 1

So most probably stolonctl got interleaved with sentinel's clusterdata update.

Jul 01 '19 08:07 arssher

Hm, now two builds failed. One of them is exactly as previous: TestKeeperRemovalStolonCtl at ha_test.go:1576.

Another is TestForceFailSyncReplStandbyCluster, at ha_test.go:1893: expected master "1a1cd8ab" in cluster view Which means sentinel failed to re-elect must after forcefail:

    utils.go:256: [sentinel dc41d2d6]: 2019-07-01T08:16:13.737Z	[34mINFO[0m	cmd/sentinel.go:1027	master db is failed	{"db": "1c4f397a", "keeper": "4a4c481b"}
    utils.go:256: [sentinel dc41d2d6]: 2019-07-01T08:16:13.738Z	[31mERROR[0m	cmd/sentinel.go:768	no eligible masters

It is unclear to me why this happened. Since I didn't touch findBestNewMasters function, most probably it is not me, but still... can we rerun the tests with debug log level enabled to see what's happening inside findBestNewMasters?

Jul 01 '19 10:07 arssher

@sgotti would you please check this PR?

Sep 05 '19 17:09 rearden-steel

@sgotti Any updates on this?

Feb 13 '20 16:02 rearden-steel

@rearden-steel this PR requires an update/rebase and we have to understand if this will cover also #696. Are you willing to take care of this PR? @lawrencejones @maksm90

Feb 14 '20 09:02 sgotti

@sgotti, I've rebased this branch. I had to modify one place because of a linter error, now all existing tests pass and

make test
INTEGRATION=1 STOLON_TEST_STORE_BACKEND=etcdv3 ./test

works well for me. As for CI, I've found it to be quite unstable these days: first it failed on build / make test with zero output; after re-pushing the brach with --force without changes all tests passed; now, I modified only a single comment and TestProxyListening failed with

TestProxyListening: utils.go:1090: err: context deadline exceeded
TestProxyListening: proxy_test.go:164: error waiting on store up: timeout
TestProxyListening: utils.go:300: stopping etcd e82d334a

which seems to be not a matter of this PR and means that etcd just failed to start before timeout fired, doesn't it?

BTW, in the #696 you mentioned that it worth to check compatibility of this PR with 05b1b0f595f39d5d2059d8f92a0bb012bb579f9d. I had a look on it, and it seems that everything works fine even without changes, since added in this PR findBestNewMaster uses findBestNewMasters, which simply skips master candidates with --can-be-master. That way, a keeper with both --priority=0 and --can-be-master=false wouldn't be promoted.

However, it may be a bit misleading for a user, so maybe we should restrict usages of both --priority=0 and --can-be-master=false simultaneously? What do you think, @sgotti @rearden-steel @arssher?

Jul 10 '20 12:07 ololobus

stolon stolon copied to clipboard

Keeper priority

stolon
stolon copied to clipboard