consul-k8s icon indicating copy to clipboard operation
consul-k8s copied to clipboard

Unstable deployment using Helm chart

Open dnlopes opened this issue 2 years ago • 2 comments

Hello,

I have been trying out multiple deployment modes for Consul. I have sucessfully deployed a multi-DC via WAN federation using autoscaling groups in AWS, and now I’m moving to try out deploying on top of K8s. The experience was pretty smooth with ASG; in the end I got a pretty stable setup where I could add/remove nodes at will and the datacenters reacted smoothly (e.g, consensus was impeccable).

However, I keep having a lot of stability issues on the Raft consensus on top of K8s:

Question

  1. increasing from 3 replicas to 5 replicas for some reason makes Raft lose consensus. I don’t undertstand this, why would Raft lose consensus when increasing the replicas?
  2. with a 5-node deployment, changing a Consul setting (e.g., log rotation) and then upgrading helm once again makes the consensus to be lost.

One of the issues I believe I detected is that, because pods are named like consul-server-1, consul-server-2, etc, when pods are refreshing, they come up with the same name as the previous one (instead of generating a random suffix for example). This makes some members of the consensus protocol to detect two nodes with the same name (e.g., consul-server-2) running on different IPs (one from the new pod, and one from the old pod that was just replaced). But because this happens very quickly, the nodes don’t have time to “forget” the old pod causing a naming conflict.

Other more general questions:

  1. how can I automate the servers upgrade process without downtime? Official documentation mentions that I should manipulate the server.partitions setting in multiple phases. However, in a “real” scenario, in which deploys are managed by CI/CD tools, does it mean I need to do multiple commits and multiple deploys to ensure the servers all receive the upgrade? It does sound a bit unproductive. Are there any other alternatives to this, while still using the official Helm chart?
  2. the helm chart is using deprecated settings, both from Consul as well as K8s settings (e.g., TLS settings and PodSecurityPolicy). Is this a known issue?

I’m probably missing something that could explain these issues, as I am fairly new working with K8s.

Helm Configuration

global:
  enabled: true
  logLevel: "debug"
  logJSON: false
  name: "dlo"
  datacenter: "dlo"
  consulAPITimeout: "5s"
  enablePodSecurityPolicies: true
  recursors: []
  tls:
    enabled: true
    enableAutoEncrypt: true
    serverAdditionalDNSSANs: []
    serverAdditionalIPSANs: []
    verify: true
    httpsOnly: true
    caCert:
      secretName: null
      secretKey: null
    caKey:
      secretName: null
      secretKey: null
  acls:
    manageSystemACLs: true
    bootstrapToken:
      secretName: null
      secretKey: null
    createReplicationToken: true
    replicationToken:
      secretName: null
      secretKey: null
  gossipEncryption:
    autoGenerate: true
  federation:
    enabled: false
    createFederationSecret: false
    primaryDatacenter: null
    primaryGateways: []
    k8sAuthMethodHost: null
  metrics:
    enabled: false
    enableAgentMetrics: false
    agentMetricsRetentionTime: "1m"
    enableGatewayMetrics: true

server:
  replicas: 5
  #affinity: null # for minikube, set null
  connect: true # setup root CA and certificates
  extraConfig: |
    {
      "log_level": "DEBUG",
      "log_file": "/consul/",
      "log_rotate_duration": "24h",
      "log_rotate_max_files": 7
    }

client:
  enabled: false
  affinity: null
  updateStrategy: |
    rollingUpdate:
      maxUnavailable: 1
    type: RollingUpdate
  extraConfig: |
    {
      "log_level": "DEBUG"
    }

ui:
  enabled: true
  service:
    enabled: true
    type: LoadBalancer
    port:
      http: 80
      https: 443
  metrics:
    enabled: false
  ingress:
    enabled: false

dns:
  enabled: false

externalServers:
  enabled: false

syncCatalog:
  enabled: false

connectInject:
  enabled: false

controller:
  enabled: false

meshGateway:
  enabled: false

ingressGateways:
  enabled: false

terminatingGateways:
  enabled: false

apiGateway:
  enabled: false

webhookCertManager:
  tolerations: null

prometheus:
  enabled: false

(I know most of the values there are the default ones, but I just wanted to have a yaml with the full configs so I could tweak incrementally)

Steps to reproduce this issue

  1. helm install with 3 replicas and wait for healthy nodes
  2. change config to 5 replicas and upgrade helm installation
  3. consensus is lost and nodes take a long time (> 5 minutes) to reach consensus

Current understanding and Expected behavior

  1. When adding nodes, consensus should not be lost
  2. When changing nodes configurations, the pod replacement should be done carefully in order to keep consensus and avoid re-elections.

Environment details

I have tested this setup both in minikube and in AWS EKS, both with the same outcomes.

dnlopes avatar Sep 19 '22 14:09 dnlopes

In order to address issue

One of the issues I believe I detected is that, because pods are named like consul-server-1, consul-server-2, etc, when pods are refreshing, they come up with the same name as the previous one (instead of generating a random suffix for example). This makes some members of the consensus protocol to detect two nodes with the same name (e.g., consul-server-2) running on different IPs (one from the new pod, and one from the old pod that was just replaced). But because this happens very quickly, the nodes don’t have time to “forget” the old pod causing a naming conflict.

I will try to add the leave_on_terminate: true setting in the server.extraConfig, to see if this helps.

I will report back once I have tested this.

dnlopes avatar Sep 21 '22 14:09 dnlopes

So, the leave_on_terminate: true setting seems to help when refreshing pods. Now, server nodes gracefully shutdown informing the whole cluster about it, so when the pod is recreated with the same name no more conflits arise. However this won’t be the case if the pod is killed abruptly due to some hardware failure, for instance. So, I'm not sure how stable this setup is under random issues like hardware failures.

dnlopes avatar Sep 21 '22 16:09 dnlopes

As consul-server is a StatefulSet, which uses external volumes (like AWS EBS for example), should a hardware VM fail, where a K8S node and the pods running on it are lost, the pods running on that node will be redeployed with their state intact and the volumes re-attached, such that new consul-server-2 gets the volume from the previous consul-server-2, so that the state from /consul/data is intact.

I would imagine that should the pod + volume get totally destroyed or corrupted, there would be a process to purge that node, so that a new node that replaces it would get snapshots from other consul servers to the current timestamp number of the consul server cluster.

darkn3rd avatar Sep 28 '22 10:09 darkn3rd

https://github.com/hashicorp/consul-k8s/issues/1612

darkn3rd avatar Oct 14 '22 22:10 darkn3rd

+1 I ran into the same issue with any consul-server pod deletions/restarts in a 3 node deployment running helm release 0.49.0 and Consul v1.13.2.

The first time I ran into this was adding annotations to the consul-server statefulset pods via the helm chart. The consul-server-2 pod would break leader elections as it tried to come up. I was later able to replicate the failure by deleting any pod and broke the cluster entirely by deleting all three.

I wound up deleting the namespace entirely and starting over with "leave_on_terminate": true, in extraConfig, but that doesn't solve for abnormal terminations such as node failures.

We're working on upgrading from the older helm chart at release version: 0.24.1 running Consul v1.8.2. I have not seen similar failures in the two years or so we have been running that version.

In my opinion, #1612 should not have been resolved.

doug-ba avatar Nov 14 '22 22:11 doug-ba

@doug-ba I agree that issue should not have been marked as resolved, but at least this one is open for visibility. When I found this issue I also posted on the official Hashicorp forum, but it didn't get much traction either.

In the end, me and my team decided to forget Consul on top of K8s for some reasons, one of them being that the helm chart doesn't seem to be production ready and that is not acceptable for our current use case.

dnlopes avatar Nov 17 '22 15:11 dnlopes

Problem still here, easy to reproduce with default values of chart v1.0.2.

spirkaa avatar Jan 10 '23 18:01 spirkaa

Let me try to ping someone directly to see if we can get some support on this issue?

@david-yu sorry for the direct tag, but it has been a long time since I opened this issue. Is this something already on Hashicorp team radar?

dnlopes avatar Jan 14 '23 12:01 dnlopes

Hi @dnlopes I have been following along and have been able to reproduce this and do see the cluster go through a leader election cycle and stabilize after 5 minutes. I'm not sure why this is happening but will try to find out. I assume all is good though after the leader election and things start to stabilize?

david-yu avatar Jan 19 '23 20:01 david-yu

Just as a follow up any config change that requires a Consul server config change or agent file flag modification such as https://developer.hashicorp.com/consul/docs/agent/config/cli-flags#_bootstrap_expect in this case, will require a rolling deploy of the servers. This is why you are seeing the servers bounce 1 by 1 in a rolling deploy fashion to roll out the new bootstrap_expect config to each of the server agents. This is expected behavior and there should be a small amount of time where the servers are bouncing 1 by 1 before things are back to normal.

We can note this in the upgrade docs if that helps folks understand why this happening.

david-yu avatar Jan 19 '23 23:01 david-yu

The real problem as I see it is if you don’t set "leave_on_terminate": true, the pods don’t rejoin the cluster. Setting "leave_on_terminate": true gets around this for normal termination like rolling restarts, but probably won’t work for abnormal terminations or node failures.

doug-ba avatar Jan 20 '23 03:01 doug-ba

Hi @doug-ba For abnormal terminations, we typically suggest using a Consul snapshot to restore the cluster if the cluster is not stable after some time.

However, if you still feel this requires more attention, if its not too much trouble I would suggest opening up a feature request with detailed server logs and reproduction steps so we can track the node failure scenario separately as I believe that is outside the scope of the original issue.

david-yu avatar Jan 20 '23 03:01 david-yu

This one likely requires further re-work on how to deal with running upgrades via Helm gracefully. From discussing with engineering we generally don't recommend setting "leave_on_terminate" to true do to a chance that your cluster can end up in a split-brain scenario when you reduce quorum size and rapidly add new servers. See also https://github.com/hashicorp/consul-helm/pull/764 for reference.

david-yu avatar Jan 25 '23 00:01 david-yu

Just as a follow up any config change that requires a Consul server config change or agent file flag modification such as https://developer.hashicorp.com/consul/docs/agent/config/cli-flags#_bootstrap_expect in this case, will require a rolling deploy of the servers. This is why you are seeing the servers bounce 1 by 1 in a rolling deploy fashion to roll out the new bootstrap_expect config to each of the server agents. This is expected behavior and there should be a small amount of time where the servers are bouncing 1 by 1 before things are back to normal.

We can note this in the upgrade docs if that helps folks understand why this happening.

To be totally honest with you (and without wanting to sound rude), I don't think that a rolling update taking 5min to stabalize can be dealt with a note in the documentation. In a K8s environment, the expectations is precisely that promoting new versions of services are easy, smooth, and without downtime. From what I remember, while the rolling update is happening, Consul might be experiencing downtime, or at least, service degratation, is this correct? This might mean APIs are unavaialble or that SRE teams are being paged to adress the incident. So, in my opinion, this should not be dealt with as a "documentation note", but probably as a bug.

The helm chart is using a StatefulSet, and it's using the StatefulSet naming convention as node-names for Consul. This means that the node identity is preserved when pods are replaced (as in, for Consul membership, a new pod is not a new member, it's the same member from the previous pod - except for the initial setup, ofc).

When pods are recycled, they come-up with the same node-name (i.e., same identity), but under a different private IP (because K8s...). This may be confusing Raft protocol because Raft is seeing the same node (which is actually a different pod) trying to rejoin the cluster under a different IP, which is probably a condition that the protocol is not expecting? (citation needed ❓).

I understand the risk of split-brain you mentioned, and actually, all hell break loose if the pod is killed abruptly (i.e., no graceful leave happens), so the definitive solution for this won't be solved by manipulating tht leave_on_terminate setting for sure.

I wonder if the design decision to use Stateful sets fits this particular use-case. But this is just my wild guess, because I'm not an expert on K8s; I'm leaving this here just as food for thought. 😅

dnlopes avatar Jan 25 '23 17:01 dnlopes

I just ran into this again today during a GKE node pool upgrade even though I have leave_on_terminate=true set. I’ve waited at least a half hour and the problem hasn’t resolved, just constant leader elections.

doug-ba avatar Jan 26 '23 20:01 doug-ba

I spoke too soon, after about 1/2 hour the leader election succeeded. I’ll see soon if the same thing happens when the nodes the other two server pods run on are drained, but this is pretty unsustainable.

doug-ba avatar Jan 26 '23 20:01 doug-ba

There were leader elections when one of the other two consul-server pods went down, but it only took about 5 minutes for the problem to resolve itself. That's still pretty unacceptable but not as bad as ~30 minutes.

doug-ba avatar Jan 26 '23 23:01 doug-ba

Hi @david-yu, we have encountered the same error, values file is:

global:
  name: consul
  datacenter: eks
  tls:
    enabled: false
    enableAutoEncrypt: false
  acls:
    manageSystemACLs: true

connectInject:
  enabled: false

client:
  enabled: false
  grpc: false

server:
  replicas: 5
  storage: 20Gi
  storageClass: gp2
  bootstrapExpect: 5
  connect: false
  disruptionBudget:
    enabled: true
    maxUnavailable: 1
  resources:
    requests:
      memory: "1Gi"
      cpu: 1
    limits:
      memory: "1Gi"
      cpu: 1

syncCatalog:
  enabled: false

ui:
  ingress:
    enabled: true
    annotations: |
      kubernetes.io/ingress.class: nginx
    hosts:
      - host: "${consul_url}"
        paths:
          - "/"

I have tested leave_on_terminate=true setting and everything works great, the pods are restarted quickly and the cluster does not interrupt its work at this moment. But I don't like the possibility of split brain, it's not suitable for production. There is also a question, what will happen if all pods are turned off at the same time, then the cluster will crash?

Considering that in k8s ip addresses are dynamic and change after the pod is recreated, is there any way to avoid problems with raft? or is it only available in the enterprise version?

andreyreshetnikov-zh avatar Jul 14 '23 13:07 andreyreshetnikov-zh

for ease of research, I reduced the number of replicas to 3 and restarted consul-server-0 pod at 19:47, I attach logs of all pods: consul-server-0.txt consul-server-1.txt consul-server-2.txt you can see that the cluster recovery took about 10 minutes, sometimes it takes up to 20 minutes. The number of replicas is not make sense, the same thing happens with 5 replicas. is there any way around this without increasing the probability of split-brain?

andreyreshetnikov-zh avatar Jul 17 '23 15:07 andreyreshetnikov-zh

Fixed by #3000 which sets leave_on_terminate=true and makes some other changes to increase stability during rollouts. This will be released in consul-k8s 1.4.0 but you can get these changes now by setting:

server:
    extraConfig: |
      {"leave_on_terminate": true, "autopilot": {"min_quorum": <node majority>, "disable_upgrade_migration": true}}
    disruptionBudget:
        maxUnavailable: 1

(replace <node majority> with whatever number is a majority of your nodes, in the actual release this will get set automatically)

lkysow avatar Jan 16 '24 17:01 lkysow

Fixed by #3000 which sets leave_on_terminate=true and makes some other changes to increase stability during rollouts. This will be released in consul-k8s 1.4.0 but you can get these changes now by setting:

server:
    extraConfig: |
      {"leave_on_terminate": true, "autopilot": {"min_quorum": <node majority>, "disable_upgrade_migration": true}}
    disruptionBudget:
        maxUnavailable: 1

(replace <node majority> with whatever number is a majority of your nodes, in the actual release this will get set automatically)

Awesome! Thanks!

dnlopes avatar Jan 18 '24 18:01 dnlopes