datadog-operator icon indicating copy to clipboard operation
datadog-operator copied to clipboard

Add untaint controller to Datadog Operator for startup taint removal

Open imdevin567 opened this issue 5 months ago • 10 comments

Summary

Add an optional untainting controller to the Datadog Operator that removes a specific taint from a node once the Datadog agent is successfully running on it. This would allow users to enforce that the Datadog agent is the first workload scheduled on new nodes, ensuring observability coverage before any other workloads begin execution.

Use Case

In many environments, it's important to ensure observability agents like the Datadog agent are running before any application workloads are scheduled on a node. One pattern to achieve this is to apply a "startup" taint (e.g., node.datadoghq.com/startup=datadog:NoSchedule) to the node pool at provisioning time. The taint blocks all pods except those that tolerate it (e.g., the Datadog agent).

Currently, there's no automated way to remove this taint once the agent is confirmed to be running. This requires out-of-band scripting or external controllers, which adds operational overhead and complexity.

By having the Datadog Operator manage this behavior, the system can:

  • Ensure the agent is the first workload on a node
  • Automatically remove the startup taint once the agent is successfully running
  • Reduce complexity and eliminate the need for custom automation

Proposal

Introduce a new optional controller in the Datadog Operator that:

  1. Watches nodes with a configurable taint key (e.g., node.datadoghq.com/startup=datadog:NoSchedule)
  2. Detects when a healthy Datadog agent pod is running on that node
  3. Removes the configured taint from the node

Configuration could be introduced via the DatadogAgent CRD, for example:

spec:
  untaint:
    enabled: true
    taintKey: "node.datadoghq.com/startup"
    taintValue: "datadog"
    taintEffect: "NoSchedule"

Similar Patterns in the Wild

Istio provides a similar feature in their Operator to ensure their control plane components are prioritized before other workloads.

Benefits

  • Ensures observability is initialized before application workloads
  • Simplifies node bootstrapping workflows in cluster autoscaling environments
  • Reduces reliance on custom scripts or external controllers
  • Aligns with patterns used in other Operators (e.g., Istio)

Thank you for considering this feature! I'd be happy to contribute or test this functionality if it's accepted.

imdevin567 avatar Jul 18 '25 15:07 imdevin567

+1, we are running on EKS and experience an issue occasionally where user pods are able to start before the node agent is ready.

drcrees avatar Jul 22 '25 14:07 drcrees

Hi @imdevin567, thanks for opening the issue and sharing your proposal. Solving this problem is something we have discussed internally and likely prioritize in 2026. However, I can't tell what the design will be, and whether it will be similar to one you are proposing.

levan-m avatar Aug 18 '25 21:08 levan-m

@levan-m, thanks for the update... that is quite a bit of lead time though, this should be considered a defect.

drcrees avatar Aug 19 '25 13:08 drcrees

Also having this issue in EKS when using Karpenter. Does anyone have a hack/solution in the meantime?

codeadict avatar Aug 21 '25 20:08 codeadict

Two avenues you can go down @codeadict that we are looking at;

  • use karpenter startup taints, then run an additional sidecar container on your DD agent that checks status of the agent and removes taint
  • change AgentCommunicationMode to socket, this falls in to the "hack" category, pods then have volume mounts injected for the sockets and the volume mounting will block the pod running until the agent has setup the sockets on the host.

jess-belliveau avatar Aug 21 '25 22:08 jess-belliveau

  • change AgentCommunicationMode to socket, this falls in to the "hack" category, pods then have volume mounts injected for the sockets and the volume mounting will block the pod running until the agent has setup the sockets on the host.

We're using UDS and it still happens. The volume is created, but it appears the file isn't there until the agent starts. Our trace clients have reported not being able to find /var/run/datadog/apm.socket

drcrees avatar Aug 22 '25 00:08 drcrees

This issue has been automatically marked as stale because it has not had activity in the past 15 days.

It will be closed in 30 days if no further activity occurs. If this issue is still relevant, adding a comment will keep it open. Also, you can always reopen the issue if you missed the window.

Thank you for your contributions!

dd-octo-sts[bot] avatar Oct 10 '25 10:10 dd-octo-sts[bot]

Keeping this open, as others seem to be having the same issue.

imdevin567 avatar Oct 11 '25 01:10 imdevin567

Hi @imdevin567,

Thank you for taking the time to share your feedback. We recently introduced a system to help manage long-standing issues through automated stale management. We understand this may have come as a surprise for some existing issues. We want to apologize if it felt that way. It was not our intent, and we want to reassure you that every issue is important to us and reviewed carefully. For your specific issue, we’re reviewing it to ensure the discussion progresses and receives the right level of attention. Thank you for helping improve the Datadog Agent.

— The Datadog Agent Team

chouetz avatar Nov 12 '25 17:11 chouetz

For others encountering this issue, Datadog Support suggested setting the following environment variable on your cluster agent if you are using UDS. This is supplemental to @jess-belliveau's second bullet above:

env:
  - name: DD_ADMISSION_CONTROLLER_INJECT_CONFIG_TYPE_SOCKET_VOLUMES
    value: "true"

Additionally, client pods will need the admission.datadoghq.com/enabled: "true" label. This config will tell the admission controller to mount the socket files themselves, rather than just the /var/run/datadog directory.

-   - mountPath: /var/run/datadog
-     name: datadog
-     readOnly: true
+   - mountPath: /var/run/datadog/dsd.socket
+     name: datadog-dogstatsd
+     readOnly: true
+   - mountPath: /var/run/datadog/apm.socket
+     name: datadog-trace-agent
+     readOnly: true

drcrees avatar Nov 21 '25 15:11 drcrees