vault-consul-on-kube
vault-consul-on-kube copied to clipboard
Upgrading Consul to 0.8.0 doesn't work
HashiCorp has released version 0.8.0 of Consul and I can't get the cluster to elect a leader after either upgrading the container or after creating the setup from scratch.
I've added acl_enforce_version_8: false
to the consul configuration and updated the secret in Kubernetes because they've changed the ACL version and the new version is now opt-out. I've also tried to find out if it might just require a small change in the consul configuration in order to use the new ACL methods, but so far I failed.
The error is as follows:
2017/04/06 15:01:18 [INFO] agent: (LAN) joined: 0 Err: 1 error(s) occurred:
* Failed to join 100.71.157.17: Member 'consul-2' has conflicting node ID '9af13111-825d-424d-9587-c78fde07af73' with this agent's ID
2017/04/06 15:01:18 [WARN] agent: Join failed: <nil>, retrying in 30s
2017/04/06 15:01:22 [ERR] consul: RPC failed to server 100.71.157.17:8300: rpc error: rpc error: Missing node registration
2017/04/06 15:01:22 [ERR] http: Request PUT /v1/session/create, error: rpc error: rpc error: Missing node registration from=127.0.0.1:36792
2017/04/06 15:01:25 [WARN] memberlist: ignoring alive message for 'consul-2': Member 'consul-2' has conflicting node ID '9af13111-825d-424d-9587-c78fde07af73' with this agent's ID
2017/04/06 15:01:32 [ERR] consul: RPC failed to server 100.70.64.97:8300: rpc error: Missing node registration
2017/04/06 15:01:32 [ERR] http: Request PUT /v1/session/create, error: rpc error: Missing node registration from=127.0.0.1:36792
2017/04/06 15:01:42 [ERR] consul: RPC failed to server 100.71.157.17:8300: rpc error: rpc error: Missing node registration
2017/04/06 15:01:42 [ERR] http: Request PUT /v1/session/create, error: rpc error: rpc error: Missing node registration from=127.0.0.1:36792
Before re-initiating the cluster, I deleted all leftover files from consul on the persistentvolumes, so I am confused where the conflicting node ID could come from.
Downgrading Consul to 0.7.5 fixes the issue. Any help is appreciated!
In the meantime, I'll be digging into Consul ACLs and try to find a fix myself, but I'm still pretty much inexperienced with Consul.
Thanks so much for the report. We haven't given it a shot yet, please do report back with anything you learn. Thanks!
I recommend reporting that behavior with your config on the consul list - also, make sure you didn't end up doing any IP address changes, as consul goes crazy when you do that. Strange but true.
I note this conversation on the Consul list that sounds like exactly your situation: https://groups.google.com/d/msg/consul-tool/dRA-cEBUnz8/9KGrFS0pDQAJ
@rfay Thanks, that's exactly the problem I'm having. I'll keep an eye on that!
I note that this seems to be addressed in the still-unreleased 0.8.1 (https://github.com/hashicorp/consul/pull/2884):
agent: Node IDs derived from host information are now hashed to prevent things like common server hardware from generating IDs with a common prefix across nodes. [GH-2884]
Previously I was going to recommend experimenting with adding the node-id file per the 0.7.4 node-id introduction
My experiment would be:
- Create /var/lib/consul/node-id file containing a unique hash on each of the 3 consul servers.
- Update consul to 0.7.5 and verify correct behavior.
- Update consul to 0.8.0 and verify correct behavior.
Another very quick test on 0.8.0 would be to change the consul args in consul-*.yaml to be unique across consul-1/2/3. Adding -node-id per consul docs should work. This also seems to be working toward a resolution of the problem with consul being tied to IP addresses.
Node Identifiers were introduced in 0.7.3, then the node-id uniqueness was enforced in 0.8.0:
- Node Identifiers: Consul agents can now be configured with a unique identifier, or they will generate one at startup that will persist across agent restarts. This identifier is designed to represent a node across all time, even if the name or address of the node changes. Identifiers are currently only exposed in node-related endpoints, but they will be used in future versions of Consul to help manage Consul servers and the Raft quorum in a more robust manner, as the quorum is currently tracked via addresses, which can change. [GH-2661]
It looks to me like a very simple technique to solve this problem is -disable-host-node-id; this stops using the host as source for a unique node id and creates a unique persistent node id. This option only becomes available in consul 0.8.1.
Setting this to true will prevent Consul from using information from the host to generate a deterministic node ID, and will instead generate a random node ID which will be persisted in the data directory. This is useful when running multiple Consul agents on the same host for testing. This defaults to false.
IMO our best path going forward is just to add this with the consul upgrade when that PR comes in (PR Welcome). It just means adding it to the args in consult-1/2/3.yaml
Unfortunately, since that argument is not supported by the currently deployed consul, we can't add the PR until consul has been upgraded at least a little ways.
And... 0.8.1 was released today
@rfay I got stateful set's working here: https://github.com/kelseyhightower/consul-on-kubernetes/pull/12 but i was unable to get vault to communicate with it. It was giving me permission errors when the agent in the vault pod was trying to create a session in the cluster Request PUT /v1/session/create: Permission Denied
. I'm sure it something with how 0.8.x is different than 0.7.x but i've just run out of time to try and figure it out.
I may be able to pick it back up this weekend.
Thanks for your work on this! It won't surprise me if the stateful sets approach is too disruptive for us at this point. Of course your excellent https://github.com/drud/vault-consul-on-kube/pull/18 may be about the same. (Assume you meant to comment on that ticket or https://github.com/drud/vault-consul-on-kube/issues/14?)
@rfay no, i ment to comment here, I may start repo that is just vault that uses @kelseyhightower consul setup and the ideas from vault here.
Once i get it working i'll let you know what i figure out.
Updating label to indicate this is paused until we have a path forward!
@rickmanelius the 0.8 series eventually put in a solution for this problem. Doesn't mean we have the bandwidth to work on it right now.