consul-k8s
consul-k8s copied to clipboard
helm:Consul Snapshot Agent is using client's tolerations
Community Note
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request. Searching for pre-existing feature requests helps us consolidate datapoints for identical requirements into a single place, thank you!
- Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request.
- If you are interested in working on this issue or have submitted a pull request, please leave a comment.
Overview of the Issue
We deploy the snapshot agent via Helm. And found that the snapshot is running on our k8s control-plane nodes, which is undesired -- snapshot agent should run on normal worker nodes.
The reason that causing the issue is -- snapshot agent is getting tolerations from Consul client (values.yaml, template yaml). This doesn't make sense. Consul clients are running on every nodes as deamonset, and it is usually set with strong tolerations. In our case, since snapagent also obtained strong tolerance, it landed to control-plane nodes because control-plane has lower hw usage than normal worker nodes.
Expected behavior
If Consul server is enabled in Helm, snapshot agent should collocate with Consul servers, so that it can get the snapshot from local.
If Consul server is not enabled (that is our case -- our Consul servers are running on baremetal), snapshot agent should run on any normal worker nodes.
Also, the snapshot agent stanza should be moved out from consul.client path, simply because snapshot agent doesn't belong to client.
Hi @weichuliu We might need more info here. Did you mean Consul clients when you mentioned Consul servers running on every nodes in a DaemonSet? Also do you mean the snapshot agent landed on the K8s Control Plane nodes (i.e. master nodes)? Typically you would not want any workloads to be scheduled on Kubernetes master nodes.
I'm curious what your Helm values file looks like. Thanks!
@david-yu
As you pointed out, I meant Consul clients running as daemonset. It's a typo.
Yes, snapshot agent landed on master nodes.
Part of our helm:
consul:
client:
enabled: true
tolerations: |
- effect: NoExecute
operator: Exists
- effect: NoSchedule
operator: Exists
snapshotAgent:
enabled: true
replicas: 2
resources:
requests:
...
limits:
...
The reason is that we set client's toleration with NoExecute/NoSchedule to make it running on all nodes.
However, this line makes snapAgent to use the same toleration. Since master node has least hw utility, the snapAgent lands on master nodes.
Hi, the reason for this is that we want the snapshot agents to run on the same nodes as consul clients. We don't want them to land on nodes that aren't running Consul clients because they require a consul client to be available.
That being said, our other components like the controller also require a consul client node and they have separate tolerations. So I think it would make sense to split this out into its own top-level key with its own tolerations:
client:
...
snapshotAgent:
...
At the same time, we'd need to ensure this change is backwards compatible.
@lkysow Yeah I think your comment makes perfect sense.
Looking forward to the fix
Closing as clients now run as sidecars to Consul servers.