datadog-operator icon indicating copy to clipboard operation
datadog-operator copied to clipboard

clusterAgent maxSurge and maxUnavailable with `kind:DatadogAgent`

Open marcinkubica opened this issue 10 months ago • 6 comments

Describe what happened: attempted to control cluster agent maxSurge and maxUnavailable with kind:DatadogAgent manifest

Describe what you expected: custom values applied

Seems unable to customise

marcinkubica avatar Apr 23 '24 14:04 marcinkubica

Thanks for submitting the issue @marcinkubica. Could you please provide following:

  • Details about your environment
  • Sample manifest
  • What you are getting
  • What you expect to get

levan-m avatar Apr 23 '24 14:04 levan-m

hi @levan-m!

apiVersion: datadoghq.com/v2alpha1
kind: DatadogAgent
metadata:
  name: datadog
  namespace: datadog
spec:
  features:
    admissionController:
      enabled: true
    apm:
      enabled: true
    clusterChecks:
      enabled: true
    kubeStateMetricsCore:
      enabled: true
    logCollection:
      containerCollectAll: false
      enabled: true
    npm:
      enabled: true
  global:
    clusterName: example-01
    credentials:
      apiSecret:
        keyName: api-key
        secretName: operator-keys
      appSecret:
        keyName: app-key
        secretName: operator-keys
    site: datadoghq.eu
  override:
    clusterAgent:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: app.kubernetes.io/component
                    operator: In
                    values:
                      - cluster-agent
              topologyKey: kubernetes.io/hostname
      containers:
        cluster-agent:
          securityContext:
            allowPrivilegeEscalation: false
            runAsNonRoot: true
            runAsUser: 1000
      image:
        tag: 7.52.1
      replicas: 2
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
    nodeAgent:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: cloud.google.com/gke-nodepool
                    operator: In
                    values:
                      - example
      image:
        tag: 7.52.1

I have a couple of gke clusters with 1 or 2 VMs. Having there 1 or 2 replicas for clusterAgent causes Pods to be unscheduled due to what I believe is rollout strategy set in Deployment generated by CRD

  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 25%
      maxSurge: 25%

I'm struggling to control maxUnavaliable/maxSurge or change to type:Recreate Having checked CRD def it doesn't seem there's override possible for deployment strategy?

marcinkubica avatar Apr 23 '24 15:04 marcinkubica

Thanks for the information. Indeed current CRD doesn't offer a way to customize deployment strategy.

To make sure I understand the problem correctly, could you confirm following:

  1. You have a cluster with 2 VMs.
  2. Deploying Agent using above manifest, which defines anti-affinity rule to prevent running two DCA pods on the same host.
  3. During deployment(?), new DCA pods don't get scheduled.
  4. Hypothesis is that this could be due to rolling update strategy and solution would be reducing maxUnavailable 0 or changing the strategy to Recreate which would delete all DCA pods before creating new ones.

I'll try to reproduce this scenario locally on Kind.

In the meantime could you describe one of those unscheduled pods and share the reason.

levan-m avatar Apr 24 '24 00:04 levan-m

hi @levan-m In essence, an issue comes in when upgrading the agent version with CRD under this specific condition.

If a number of clusterAgent replicas == a number of nodes & podAntiAffinity we have the usual scenario where a new pod can't be scheduled for rollout because there are no hosts available unless we allow scheduling multiple Pods on the same host.

This can be addressed by either changing to Recreate or with maxUnavailable and maxSurge pair.

I believe allowing to override spec.strategy is the solution here.

Thank you!

marcinkubica avatar Apr 24 '24 09:04 marcinkubica

Thanks for the context. I was able to reproduce the issue on a Kind cluster. Scaling Operator pods to 0 and manipulating Cluster Agent spec.strategy resolves the issue.

I'll create a feature request in our backlog. Will post an update once we determine the priority and ETA.

levan-m avatar Apr 30 '24 01:04 levan-m

May I add something here please. since alphav2 this is also not possible anymore for the agent daemonset. with alphav1 there was the possibility to override the maxunavailable - since v2 not. In huge clusters this limits us to potentially wait for hours on the rollouts.

rockaut avatar Jun 11 '24 10:06 rockaut

Hello @marcinkubica, thanks for bringing this to our attention. We added strategy overrides for the Cluster Agent and Cluster checks runner in v1.8. @rockaut Daemonset Rollout Strategy should be able to be overriden now too. Let us know if you encounter any issues!

Eokye avatar Aug 23 '24 16:08 Eokye