vault-consul-on-kube Upgrading Consul to 0.8.0 doesn't work

HashiCorp has released version 0.8.0 of Consul and I can't get the cluster to elect a leader after either upgrading the container or after creating the setup from scratch.

I've added acl_enforce_version_8: false to the consul configuration and updated the secret in Kubernetes because they've changed the ACL version and the new version is now opt-out. I've also tried to find out if it might just require a small change in the consul configuration in order to use the new ACL methods, but so far I failed.

The error is as follows:

    2017/04/06 15:01:18 [INFO] agent: (LAN) joined: 0 Err: 1 error(s) occurred:

* Failed to join 100.71.157.17: Member 'consul-2' has conflicting node ID '9af13111-825d-424d-9587-c78fde07af73' with this agent's ID
    2017/04/06 15:01:18 [WARN] agent: Join failed: <nil>, retrying in 30s
    2017/04/06 15:01:22 [ERR] consul: RPC failed to server 100.71.157.17:8300: rpc error: rpc error: Missing node registration
    2017/04/06 15:01:22 [ERR] http: Request PUT /v1/session/create, error: rpc error: rpc error: Missing node registration from=127.0.0.1:36792
    2017/04/06 15:01:25 [WARN] memberlist: ignoring alive message for 'consul-2': Member 'consul-2' has conflicting node ID '9af13111-825d-424d-9587-c78fde07af73' with this agent's ID
    2017/04/06 15:01:32 [ERR] consul: RPC failed to server 100.70.64.97:8300: rpc error: Missing node registration
    2017/04/06 15:01:32 [ERR] http: Request PUT /v1/session/create, error: rpc error: Missing node registration from=127.0.0.1:36792
    2017/04/06 15:01:42 [ERR] consul: RPC failed to server 100.71.157.17:8300: rpc error: rpc error: Missing node registration
    2017/04/06 15:01:42 [ERR] http: Request PUT /v1/session/create, error: rpc error: rpc error: Missing node registration from=127.0.0.1:36792

Before re-initiating the cluster, I deleted all leftover files from consul on the persistentvolumes, so I am confused where the conflicting node ID could come from.

Downgrading Consul to 0.7.5 fixes the issue. Any help is appreciated!

In the meantime, I'll be digging into Consul ACLs and try to find a fix myself, but I'm still pretty much inexperienced with Consul.

Apr 07 '17 15:04 dictvm

Thanks so much for the report. We haven't given it a shot yet, please do report back with anything you learn. Thanks!

Apr 07 '17 19:04 rfay

I recommend reporting that behavior with your config on the consul list - also, make sure you didn't end up doing any IP address changes, as consul goes crazy when you do that. Strange but true.

Apr 07 '17 19:04 rfay

I note this conversation on the Consul list that sounds like exactly your situation: https://groups.google.com/d/msg/consul-tool/dRA-cEBUnz8/9KGrFS0pDQAJ

Apr 09 '17 22:04 rfay

@rfay Thanks, that's exactly the problem I'm having. I'll keep an eye on that!

Apr 10 '17 09:04 dictvm

I note that this seems to be addressed in the still-unreleased 0.8.1 (https://github.com/hashicorp/consul/pull/2884):

agent: Node IDs derived from host information are now hashed to prevent things like common server hardware from generating IDs with a common prefix across nodes. [GH-2884]

Previously I was going to recommend experimenting with adding the node-id file per the 0.7.4 node-id introduction

My experiment would be:

Create /var/lib/consul/node-id file containing a unique hash on each of the 3 consul servers.
Update consul to 0.7.5 and verify correct behavior.
Update consul to 0.8.0 and verify correct behavior.

Another very quick test on 0.8.0 would be to change the consul args in consul-*.yaml to be unique across consul-1/2/3. Adding -node-id per consul docs should work. This also seems to be working toward a resolution of the problem with consul being tied to IP addresses.

Node Identifiers were introduced in 0.7.3, then the node-id uniqueness was enforced in 0.8.0:

Node Identifiers: Consul agents can now be configured with a unique identifier, or they will generate one at startup that will persist across agent restarts. This identifier is designed to represent a node across all time, even if the name or address of the node changes. Identifiers are currently only exposed in node-related endpoints, but they will be used in future versions of Consul to help manage Consul servers and the Raft quorum in a more robust manner, as the quorum is currently tracked via addresses, which can change. [GH-2661]

Apr 13 '17 16:04 rfay

It looks to me like a very simple technique to solve this problem is -disable-host-node-id; this stops using the host as source for a unique node id and creates a unique persistent node id. This option only becomes available in consul 0.8.1.

Setting this to true will prevent Consul from using information from the host to generate a deterministic node ID, and will instead generate a random node ID which will be persisted in the data directory. This is useful when running multiple Consul agents on the same host for testing. This defaults to false.

IMO our best path going forward is just to add this with the consul upgrade when that PR comes in (PR Welcome). It just means adding it to the args in consult-1/2/3.yaml

Unfortunately, since that argument is not supported by the currently deployed consul, we can't add the PR until consul has been upgraded at least a little ways.

Apr 16 '17 15:04 rfay

And... 0.8.1 was released today

Apr 17 '17 18:04 rfay

@rfay I got stateful set's working here: https://github.com/kelseyhightower/consul-on-kubernetes/pull/12 but i was unable to get vault to communicate with it. It was giving me permission errors when the agent in the vault pod was trying to create a session in the cluster Request PUT /v1/session/create: Permission Denied. I'm sure it something with how 0.8.x is different than 0.7.x but i've just run out of time to try and figure it out.

I may be able to pick it back up this weekend.

Apr 20 '17 13:04 jwhitcraft

Thanks for your work on this! It won't surprise me if the stateful sets approach is too disruptive for us at this point. Of course your excellent https://github.com/drud/vault-consul-on-kube/pull/18 may be about the same. (Assume you meant to comment on that ticket or https://github.com/drud/vault-consul-on-kube/issues/14?)

Apr 20 '17 13:04 rfay

@rfay no, i ment to comment here, I may start repo that is just vault that uses @kelseyhightower consul setup and the ideas from vault here.

Once i get it working i'll let you know what i figure out.

Apr 20 '17 13:04 jwhitcraft

Updating label to indicate this is paused until we have a path forward!

Jun 20 '17 14:06 rickmanelius

@rickmanelius the 0.8 series eventually put in a solution for this problem. Doesn't mean we have the bandwidth to work on it right now.

Jun 25 '17 22:06 rfay

vault-consul-on-kube vault-consul-on-kube copied to clipboard

Upgrading Consul to 0.8.0 doesn't work

vault-consul-on-kube
vault-consul-on-kube copied to clipboard