consul-k8s icon indicating copy to clipboard operation
consul-k8s copied to clipboard

helm:Consul Snapshot Agent is using client's tolerations

Open weichuliu opened this issue 4 years ago • 5 comments

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request. Searching for pre-existing feature requests helps us consolidate datapoints for identical requirements into a single place, thank you!
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request.
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment.

Overview of the Issue

We deploy the snapshot agent via Helm. And found that the snapshot is running on our k8s control-plane nodes, which is undesired -- snapshot agent should run on normal worker nodes.

The reason that causing the issue is -- snapshot agent is getting tolerations from Consul client (values.yaml, template yaml). This doesn't make sense. Consul clients are running on every nodes as deamonset, and it is usually set with strong tolerations. In our case, since snapagent also obtained strong tolerance, it landed to control-plane nodes because control-plane has lower hw usage than normal worker nodes.

Expected behavior

If Consul server is enabled in Helm, snapshot agent should collocate with Consul servers, so that it can get the snapshot from local.

If Consul server is not enabled (that is our case -- our Consul servers are running on baremetal), snapshot agent should run on any normal worker nodes.

Also, the snapshot agent stanza should be moved out from consul.client path, simply because snapshot agent doesn't belong to client.

weichuliu avatar Jun 25 '21 02:06 weichuliu

Hi @weichuliu We might need more info here. Did you mean Consul clients when you mentioned Consul servers running on every nodes in a DaemonSet? Also do you mean the snapshot agent landed on the K8s Control Plane nodes (i.e. master nodes)? Typically you would not want any workloads to be scheduled on Kubernetes master nodes.

I'm curious what your Helm values file looks like. Thanks!

david-yu avatar Jun 25 '21 18:06 david-yu

@david-yu

As you pointed out, I meant Consul clients running as daemonset. It's a typo.

Yes, snapshot agent landed on master nodes.

Part of our helm:

consul:
  client:
    enabled: true

    tolerations: |
      - effect: NoExecute
        operator: Exists
      - effect: NoSchedule
        operator: Exists
    snapshotAgent:
      enabled: true
      replicas: 2
      resources:
        requests:
          ...
        limits:
          ...

weichuliu avatar Jun 28 '21 02:06 weichuliu

The reason is that we set client's toleration with NoExecute/NoSchedule to make it running on all nodes.

However, this line makes snapAgent to use the same toleration. Since master node has least hw utility, the snapAgent lands on master nodes.

weichuliu avatar Jun 28 '21 02:06 weichuliu

Hi, the reason for this is that we want the snapshot agents to run on the same nodes as consul clients. We don't want them to land on nodes that aren't running Consul clients because they require a consul client to be available.

That being said, our other components like the controller also require a consul client node and they have separate tolerations. So I think it would make sense to split this out into its own top-level key with its own tolerations:

client:
  ...
snapshotAgent:
  ...

At the same time, we'd need to ensure this change is backwards compatible.

lkysow avatar Jun 30 '21 15:06 lkysow

@lkysow Yeah I think your comment makes perfect sense.

Looking forward to the fix

weichuliu avatar Jul 01 '21 05:07 weichuliu

Closing as clients now run as sidecars to Consul servers.

david-yu avatar Nov 17 '22 03:11 david-yu