consul-k8s
consul-k8s copied to clipboard
Unstable deployment using Helm chart
Hello,
I have been trying out multiple deployment modes for Consul. I have sucessfully deployed a multi-DC via WAN federation using autoscaling groups in AWS, and now I’m moving to try out deploying on top of K8s. The experience was pretty smooth with ASG; in the end I got a pretty stable setup where I could add/remove nodes at will and the datacenters reacted smoothly (e.g, consensus was impeccable).
However, I keep having a lot of stability issues on the Raft consensus on top of K8s:
Question
- increasing from 3 replicas to 5 replicas for some reason makes Raft lose consensus. I don’t undertstand this, why would Raft lose consensus when increasing the replicas?
- with a 5-node deployment, changing a Consul setting (e.g., log rotation) and then upgrading helm once again makes the consensus to be lost.
One of the issues I believe I detected is that, because pods are named like consul-server-1, consul-server-2, etc, when pods are refreshing, they come up with the same name as the previous one (instead of generating a random suffix for example). This makes some members of the consensus protocol to detect two nodes with the same name (e.g., consul-server-2) running on different IPs (one from the new pod, and one from the old pod that was just replaced). But because this happens very quickly, the nodes don’t have time to “forget” the old pod causing a naming conflict.
Other more general questions:
- how can I automate the servers upgrade process without downtime? Official documentation mentions that I should manipulate the
server.partitions
setting in multiple phases. However, in a “real” scenario, in which deploys are managed by CI/CD tools, does it mean I need to do multiple commits and multiple deploys to ensure the servers all receive the upgrade? It does sound a bit unproductive. Are there any other alternatives to this, while still using the official Helm chart? - the helm chart is using deprecated settings, both from Consul as well as K8s settings (e.g., TLS settings and PodSecurityPolicy). Is this a known issue?
I’m probably missing something that could explain these issues, as I am fairly new working with K8s.
Helm Configuration
global:
enabled: true
logLevel: "debug"
logJSON: false
name: "dlo"
datacenter: "dlo"
consulAPITimeout: "5s"
enablePodSecurityPolicies: true
recursors: []
tls:
enabled: true
enableAutoEncrypt: true
serverAdditionalDNSSANs: []
serverAdditionalIPSANs: []
verify: true
httpsOnly: true
caCert:
secretName: null
secretKey: null
caKey:
secretName: null
secretKey: null
acls:
manageSystemACLs: true
bootstrapToken:
secretName: null
secretKey: null
createReplicationToken: true
replicationToken:
secretName: null
secretKey: null
gossipEncryption:
autoGenerate: true
federation:
enabled: false
createFederationSecret: false
primaryDatacenter: null
primaryGateways: []
k8sAuthMethodHost: null
metrics:
enabled: false
enableAgentMetrics: false
agentMetricsRetentionTime: "1m"
enableGatewayMetrics: true
server:
replicas: 5
#affinity: null # for minikube, set null
connect: true # setup root CA and certificates
extraConfig: |
{
"log_level": "DEBUG",
"log_file": "/consul/",
"log_rotate_duration": "24h",
"log_rotate_max_files": 7
}
client:
enabled: false
affinity: null
updateStrategy: |
rollingUpdate:
maxUnavailable: 1
type: RollingUpdate
extraConfig: |
{
"log_level": "DEBUG"
}
ui:
enabled: true
service:
enabled: true
type: LoadBalancer
port:
http: 80
https: 443
metrics:
enabled: false
ingress:
enabled: false
dns:
enabled: false
externalServers:
enabled: false
syncCatalog:
enabled: false
connectInject:
enabled: false
controller:
enabled: false
meshGateway:
enabled: false
ingressGateways:
enabled: false
terminatingGateways:
enabled: false
apiGateway:
enabled: false
webhookCertManager:
tolerations: null
prometheus:
enabled: false
(I know most of the values there are the default ones, but I just wanted to have a yaml with the full configs so I could tweak incrementally)
Steps to reproduce this issue
- helm install with 3 replicas and wait for healthy nodes
- change config to 5 replicas and upgrade helm installation
- consensus is lost and nodes take a long time (> 5 minutes) to reach consensus
Current understanding and Expected behavior
- When adding nodes, consensus should not be lost
- When changing nodes configurations, the pod replacement should be done carefully in order to keep consensus and avoid re-elections.
Environment details
I have tested this setup both in minikube and in AWS EKS, both with the same outcomes.
In order to address issue
One of the issues I believe I detected is that, because pods are named like consul-server-1, consul-server-2, etc, when pods are refreshing, they come up with the same name as the previous one (instead of generating a random suffix for example). This makes some members of the consensus protocol to detect two nodes with the same name (e.g., consul-server-2) running on different IPs (one from the new pod, and one from the old pod that was just replaced). But because this happens very quickly, the nodes don’t have time to “forget” the old pod causing a naming conflict.
I will try to add the leave_on_terminate: true
setting in the server.extraConfig
, to see if this helps.
I will report back once I have tested this.
So, the leave_on_terminate: true
setting seems to help when refreshing pods. Now, server nodes gracefully shutdown informing the whole cluster about it, so when the pod is recreated with the same name no more conflits arise. However this won’t be the case if the pod is killed abruptly due to some hardware failure, for instance. So, I'm not sure how stable this setup is under random issues like hardware failures.
As consul-server is a StatefulSet, which uses external volumes (like AWS EBS for example), should a hardware VM fail, where a K8S node and the pods running on it are lost, the pods running on that node will be redeployed with their state intact and the volumes re-attached, such that new consul-server-2 gets the volume from the previous consul-server-2, so that the state from /consul/data
is intact.
I would imagine that should the pod + volume get totally destroyed or corrupted, there would be a process to purge that node, so that a new node that replaces it would get snapshots from other consul servers to the current timestamp number of the consul server cluster.
https://github.com/hashicorp/consul-k8s/issues/1612
+1 I ran into the same issue with any consul-server pod deletions/restarts in a 3 node deployment running helm release 0.49.0 and Consul v1.13.2.
The first time I ran into this was adding annotations to the consul-server statefulset pods via the helm chart. The consul-server-2 pod would break leader elections as it tried to come up. I was later able to replicate the failure by deleting any pod and broke the cluster entirely by deleting all three.
I wound up deleting the namespace entirely and starting over with "leave_on_terminate": true,
in extraConfig, but that doesn't solve for abnormal terminations such as node failures.
We're working on upgrading from the older helm chart at release version: 0.24.1 running Consul v1.8.2. I have not seen similar failures in the two years or so we have been running that version.
In my opinion, #1612 should not have been resolved.
@doug-ba I agree that issue should not have been marked as resolved, but at least this one is open for visibility. When I found this issue I also posted on the official Hashicorp forum, but it didn't get much traction either.
In the end, me and my team decided to forget Consul on top of K8s for some reasons, one of them being that the helm chart doesn't seem to be production ready and that is not acceptable for our current use case.
Problem still here, easy to reproduce with default values of chart v1.0.2.
Let me try to ping someone directly to see if we can get some support on this issue?
@david-yu sorry for the direct tag, but it has been a long time since I opened this issue. Is this something already on Hashicorp team radar?
Hi @dnlopes I have been following along and have been able to reproduce this and do see the cluster go through a leader election cycle and stabilize after 5 minutes. I'm not sure why this is happening but will try to find out. I assume all is good though after the leader election and things start to stabilize?
Just as a follow up any config change that requires a Consul server config change or agent file flag modification such as https://developer.hashicorp.com/consul/docs/agent/config/cli-flags#_bootstrap_expect in this case, will require a rolling deploy of the servers. This is why you are seeing the servers bounce 1 by 1 in a rolling deploy fashion to roll out the new bootstrap_expect
config to each of the server agents. This is expected behavior and there should be a small amount of time where the servers are bouncing 1 by 1 before things are back to normal.
We can note this in the upgrade docs if that helps folks understand why this happening.
The real problem as I see it is if you don’t set "leave_on_terminate": true, the pods don’t rejoin the cluster. Setting "leave_on_terminate": true gets around this for normal termination like rolling restarts, but probably won’t work for abnormal terminations or node failures.
Hi @doug-ba For abnormal terminations, we typically suggest using a Consul snapshot to restore the cluster if the cluster is not stable after some time.
However, if you still feel this requires more attention, if its not too much trouble I would suggest opening up a feature request with detailed server logs and reproduction steps so we can track the node failure scenario separately as I believe that is outside the scope of the original issue.
This one likely requires further re-work on how to deal with running upgrades via Helm gracefully. From discussing with engineering we generally don't recommend setting "leave_on_terminate" to true
do to a chance that your cluster can end up in a split-brain scenario when you reduce quorum size and rapidly add new servers. See also https://github.com/hashicorp/consul-helm/pull/764 for reference.
Just as a follow up any config change that requires a Consul server config change or agent file flag modification such as https://developer.hashicorp.com/consul/docs/agent/config/cli-flags#_bootstrap_expect in this case, will require a rolling deploy of the servers. This is why you are seeing the servers bounce 1 by 1 in a rolling deploy fashion to roll out the new
bootstrap_expect
config to each of the server agents. This is expected behavior and there should be a small amount of time where the servers are bouncing 1 by 1 before things are back to normal.We can note this in the upgrade docs if that helps folks understand why this happening.
To be totally honest with you (and without wanting to sound rude), I don't think that a rolling update taking 5min to stabalize can be dealt with a note in the documentation. In a K8s environment, the expectations is precisely that promoting new versions of services are easy, smooth, and without downtime. From what I remember, while the rolling update is happening, Consul might be experiencing downtime, or at least, service degratation, is this correct? This might mean APIs are unavaialble or that SRE teams are being paged to adress the incident. So, in my opinion, this should not be dealt with as a "documentation note", but probably as a bug.
The helm chart is using a StatefulSet
, and it's using the StatefulSet naming convention as node-names for Consul. This means that the node identity is preserved when pods are replaced (as in, for Consul membership, a new pod is not a new member, it's the same member from the previous pod - except for the initial setup, ofc).
When pods are recycled, they come-up with the same node-name
(i.e., same identity), but under a different private IP (because K8s...). This may be confusing Raft protocol because Raft is seeing the same node (which is actually a different pod) trying to rejoin the cluster under a different IP, which is probably a condition that the protocol is not expecting? (citation needed ❓).
I understand the risk of split-brain you mentioned, and actually, all hell break loose if the pod is killed abruptly (i.e., no graceful leave happens), so the definitive solution for this won't be solved by manipulating tht leave_on_terminate
setting for sure.
I wonder if the design decision to use Stateful sets fits this particular use-case. But this is just my wild guess, because I'm not an expert on K8s; I'm leaving this here just as food for thought. 😅
I just ran into this again today during a GKE node pool upgrade even though I have leave_on_terminate=true set. I’ve waited at least a half hour and the problem hasn’t resolved, just constant leader elections.
I spoke too soon, after about 1/2 hour the leader election succeeded. I’ll see soon if the same thing happens when the nodes the other two server pods run on are drained, but this is pretty unsustainable.
There were leader elections when one of the other two consul-server pods went down, but it only took about 5 minutes for the problem to resolve itself. That's still pretty unacceptable but not as bad as ~30 minutes.
Hi @david-yu, we have encountered the same error, values file is:
global:
name: consul
datacenter: eks
tls:
enabled: false
enableAutoEncrypt: false
acls:
manageSystemACLs: true
connectInject:
enabled: false
client:
enabled: false
grpc: false
server:
replicas: 5
storage: 20Gi
storageClass: gp2
bootstrapExpect: 5
connect: false
disruptionBudget:
enabled: true
maxUnavailable: 1
resources:
requests:
memory: "1Gi"
cpu: 1
limits:
memory: "1Gi"
cpu: 1
syncCatalog:
enabled: false
ui:
ingress:
enabled: true
annotations: |
kubernetes.io/ingress.class: nginx
hosts:
- host: "${consul_url}"
paths:
- "/"
I have tested leave_on_terminate=true setting and everything works great, the pods are restarted quickly and the cluster does not interrupt its work at this moment. But I don't like the possibility of split brain, it's not suitable for production. There is also a question, what will happen if all pods are turned off at the same time, then the cluster will crash?
Considering that in k8s ip addresses are dynamic and change after the pod is recreated, is there any way to avoid problems with raft? or is it only available in the enterprise version?
for ease of research, I reduced the number of replicas to 3 and restarted consul-server-0
pod at 19:47, I attach logs of all pods:
consul-server-0.txt
consul-server-1.txt
consul-server-2.txt
you can see that the cluster recovery took about 10 minutes, sometimes it takes up to 20 minutes. The number of replicas is not make sense, the same thing happens with 5 replicas.
is there any way around this without increasing the probability of split-brain?
Fixed by #3000 which sets leave_on_terminate=true and makes some other changes to increase stability during rollouts. This will be released in consul-k8s 1.4.0 but you can get these changes now by setting:
server:
extraConfig: |
{"leave_on_terminate": true, "autopilot": {"min_quorum": <node majority>, "disable_upgrade_migration": true}}
disruptionBudget:
maxUnavailable: 1
(replace <node majority>
with whatever number is a majority of your nodes, in the actual release this will get set automatically)
Fixed by #3000 which sets leave_on_terminate=true and makes some other changes to increase stability during rollouts. This will be released in consul-k8s 1.4.0 but you can get these changes now by setting:
server: extraConfig: | {"leave_on_terminate": true, "autopilot": {"min_quorum": <node majority>, "disable_upgrade_migration": true}} disruptionBudget: maxUnavailable: 1
(replace
<node majority>
with whatever number is a majority of your nodes, in the actual release this will get set automatically)
Awesome! Thanks!