datadog-operator
datadog-operator copied to clipboard
clusterAgent maxSurge and maxUnavailable with `kind:DatadogAgent`
Describe what happened:
attempted to control cluster agent maxSurge and maxUnavailable with kind:DatadogAgent
manifest
Describe what you expected: custom values applied
Seems unable to customise
Thanks for submitting the issue @marcinkubica. Could you please provide following:
- Details about your environment
- Sample manifest
- What you are getting
- What you expect to get
hi @levan-m!
apiVersion: datadoghq.com/v2alpha1
kind: DatadogAgent
metadata:
name: datadog
namespace: datadog
spec:
features:
admissionController:
enabled: true
apm:
enabled: true
clusterChecks:
enabled: true
kubeStateMetricsCore:
enabled: true
logCollection:
containerCollectAll: false
enabled: true
npm:
enabled: true
global:
clusterName: example-01
credentials:
apiSecret:
keyName: api-key
secretName: operator-keys
appSecret:
keyName: app-key
secretName: operator-keys
site: datadoghq.eu
override:
clusterAgent:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app.kubernetes.io/component
operator: In
values:
- cluster-agent
topologyKey: kubernetes.io/hostname
containers:
cluster-agent:
securityContext:
allowPrivilegeEscalation: false
runAsNonRoot: true
runAsUser: 1000
image:
tag: 7.52.1
replicas: 2
securityContext:
runAsNonRoot: true
runAsUser: 1000
nodeAgent:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: cloud.google.com/gke-nodepool
operator: In
values:
- example
image:
tag: 7.52.1
I have a couple of gke clusters with 1 or 2 VMs. Having there 1 or 2 replicas for clusterAgent causes Pods to be unscheduled due to what I believe is rollout strategy set in Deployment generated by CRD
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 25%
maxSurge: 25%
I'm struggling to control maxUnavaliable/maxSurge or change to type:Recreate
Having checked CRD def it doesn't seem there's override possible for deployment strategy?
Thanks for the information. Indeed current CRD doesn't offer a way to customize deployment strategy.
To make sure I understand the problem correctly, could you confirm following:
- You have a cluster with 2 VMs.
- Deploying Agent using above manifest, which defines anti-affinity rule to prevent running two DCA pods on the same host.
- During deployment(?), new DCA pods don't get scheduled.
- Hypothesis is that this could be due to rolling update strategy and solution would be reducing maxUnavailable 0 or changing the strategy to
Recreate
which would delete all DCA pods before creating new ones.
I'll try to reproduce this scenario locally on Kind.
In the meantime could you describe one of those unscheduled pods and share the reason.
hi @levan-m In essence, an issue comes in when upgrading the agent version with CRD under this specific condition.
If a number of clusterAgent replicas == a number of nodes & podAntiAffinity we have the usual scenario where a new pod can't be scheduled for rollout because there are no hosts available unless we allow scheduling multiple Pods on the same host.
This can be addressed by either changing to Recreate or with maxUnavailable and maxSurge pair.
I believe allowing to override spec.strategy
is the solution here.
Thank you!
Thanks for the context. I was able to reproduce the issue on a Kind cluster. Scaling Operator pods to 0 and manipulating Cluster Agent spec.strategy
resolves the issue.
I'll create a feature request in our backlog. Will post an update once we determine the priority and ETA.
May I add something here please. since alphav2 this is also not possible anymore for the agent daemonset. with alphav1 there was the possibility to override the maxunavailable - since v2 not. In huge clusters this limits us to potentially wait for hours on the rollouts.
Hello @marcinkubica, thanks for bringing this to our attention. We added strategy overrides for the Cluster Agent and Cluster checks runner in v1.8. @rockaut Daemonset Rollout Strategy should be able to be overriden now too. Let us know if you encounter any issues!