Automatically skip self-join
It is unclear to me from the docs of start_join/retry_join whether this value should contain all consul servers that I want in the cluster, or only the other ones (not the current machine).
Looking at a 3-node cluster boostrapping from a automatic deployment, it seems to me that it has to be the latter, because otherwise if retry_join = [machine1, machine2, machine3]:
- machine1 boots up first, sees only itself, joins itself
- machine2 and 3 boot at the same time a bit later, try to join each other first, which succeeds
Because the behaviour of retry_join is "Takes a list of addresses to attempt joining every retry_interval until at least one join works" instead of "until all joins work", each none of the machines try to join any other machines.
So due to this race we end up in a situation where we have 2 split clusters {machine1} and {machine2,machine3}, as opposed to the desired {machine1,machine2,machine3}. And it stays that way.
And then we get lots of agent: failed to sync remote state: No cluster leader.
Is this analysis correct?
The race is prevented if you use bootstrap_expect as well since then the cluster will wait to bootstrap until it has at least a certain number of candidates.
The race is prevented if you use bootstrap_expect as well since then the cluster will wait to bootstrap until it has at least a certain number of candidates.
I should have said; I use bootstrap_expect = 3. Should the race really be prevented by using it? The problem seems to not be that the cluster bootstraps too early, but that due to the behaviour of retry_join stopping at the first successful join, I never get to the expected 3 nodes in a set.
We have nodes that we shut down and start up every other day and we noticed a failure to startup when start_join is used in the configuration, so we removed it and sticked to retry_join. A couple of days ago I spinned up a new 3-node cluster (using retry_join only) and the nodes refused to cluster up, I ended up having to add start_join manually for the initial bootstrapping and afterwards removing it.
The usage of these configuration options is rather unclear and it doesn't seem like there's and ideal combination of them for our use case (as I said initially, start_join & retry_join seem to work well unless your leader node shuts down...)
I agree with @nh2, i just ran into similar issue and after i restarted the consul service on all three servers manually, new leader was elected this breaks our automation setup.
I used bootstrap_expect=3 and landed in same scenario. The only difference is i am not using retry_join instead using retry_join_ec2.
Seems like skipping the self-join would make this more robust and let you still keep a simple, static configuration that has a full list of servers (vs. enumerating only different servers).
@slackpad just trying to digest your sentence "skipping the self-join would make this more robust" ? What does that imply by skipping self-join would make this robust ? How does this resolve the split cluster/node issue ?
@sbatchu0108 I was thinking if we retry-join not attempt to join itself and stop there then this would make things simpler to configure (you can always list all 3 server IPs) and more robust (you will keep trying until you join some other node). It's still not perfect, especially with more than 3 servers, but since retry-join always tries to join the full list, the last server to attempt the join should always tie everyone together.
Seems like skipping the self-join would make this more robust and let you still keep a simple, static configuration that has a full list of servers (vs. enumerating only different servers).
Yes, absolutely.
I proposed the documentation change instead of this because it would be asking for less, but if consul can be changed so that it doesn't consider joining itself as a successful bootstrapping operation, that would be much better (people wouldn't have to change their configs, this would be very intuitive, and easier to configure -- imagine all nodes starting from the same machine image in a cloud environment). In that case, I think we can change this issue from an enhancement to a bug.
@slackpad
I was thinking if we retry-join not attempt to join itself and stop there then this would make things simpler to configure (you can always list all 3 server IPs) and more robust (you will keep trying until you join some other node)
Ahh got it, agree if consul can be changed then it would make better for everyone to leave their configs as it is. Much appreciated with this proposed change.
I wanted to check in 2021 and add in that I've encountered this as well. I'd also like to put a vote for possibly skipping rejoining self and instead joining a different node, if there are other hosts present in retry_join.
Hello Consul community members,
We would welcome a PR contributed by the community for this usability enhancement!
If you're interested, please comment here so anyone interested can stay informed.
The approach should ensure the following:
- "Self" entries in
start_joinandretry_joinshould be ignored- Think through how to robustly assess whether an entry is "self"
- Emit an error if
start_joinorretry_joincontain no entries other than "self" (e.g., empty list, list with only "self" entries)
Note: we anticipate that this could require more effort to implement and to test/review than it appears.
Hello, I'd be interested in taking this on!
Hi @ADustyOldMuffin,
Just checking in to see how things are going and whether you have any questions!
No sorry! I started looking into this and didn't get very far while trying to get my environment setup for it. I unfortunately won't have anytime to look at this.
Hi all, I looked into this a bit today and this is what I found. It looks like joining the cluster is handled here: https://github.com/hashicorp/consul/blob/676ea58bc4f91ca751764f1db5a53c5a9459b062/agent/consul/client.go#L195
Ultimately, Serf is used to facilitate communication between the nodes in a Consul cluster. However, Serf is using memberlist under the hood to actually manage cluster membership.
Based on the above, I think that this fix could potentially be implemented in memberlist, which would effectively apply to Consul after bumping memberlist in Serf, and then Serf in Consul. Does this seem reasonable, or am I missing something here? I'd be interested in looking into this further if this seems like the right path to go down.
Hey @aaronraff, thank you for looking into this.
I think a less invasive fix would be to exclude the current server address in here.
The local addr can be injected in the retryJoiner so we can do the check in retryJoinAddrs
Great idea @dhiaayachi, I'm happy to try this out. I appreciate the code pointers!
Hey @aaronraff
Are you still interested in this fix?
Apologies for the delay here. I should have a PR up sometime later this week!
@aaronraff
No problem, look forward to it! :)
@Amier3 @aaronraff What is the status at the moment? Love to see this implemented.
Noticed the PR got closed due to @aaronraff not having the CLA, would be a shame if this change gets lost because of that.
Unfortunately I was falling over this PR. My scenario was: 3 Server bootstrap_expect = 3 retry_join = ["loadbalancer_ip"]
the loadbalancer is pointing to all 3 Server. Unfortunately the loadbalancer has redirected each server to itself, which leaded to consul stopped the retry. Would be cool, if this could be fixed.