datadog-operator Monitors modified out-of-band are not reconcilled

Describe what happened:

I've successfully created a Monitor with the following manifest:

apiVersion: datadoghq.com/v1alpha1
kind: DatadogMonitor
metadata:
  name: hello-world-web
  namespace: hello-world
spec:
  query: "avg(last_5m):avg:kubernetes_state.hpa.desired_replicas{hpa:hello-world-web,kube_namespace:hello-world} / avg:kubernetes_state.hpa.max_replicas{hpa:hello-world-web,kube_namespace:hello-world} >= 1"
  type: "query alert"
  name: "HPA TEST"
  message: "Number of replicas for HPA hello-world-web in namespace hello-world, has reached maximum threshold @target1"
  tags:
    - "service:hello-world"

I then used the DataDog console to edit that Monitor. I modified the message, changing @target1 to @target2

Describe what you expected:

I expected the controller to reconcile the monitor, changing @target2 back to @target1, this never happened even though it appears the resource is being continually synced successfully:

kubectl describe datadog-monitor hello-world

Name:         hello-world-web
Namespace:    hello-world
Labels:       <none>
Annotations:  <none>
API Version:  datadoghq.com/v1alpha1
Kind:         DatadogMonitor
Metadata:
  Creation Timestamp:  2021-12-15T12:53:41Z
  Finalizers:
    finalizer.monitor.datadoghq.com
  Generation:  3
  Managed Fields:
    API Version:  datadoghq.com/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:kubectl.kubernetes.io/last-applied-configuration:
      f:spec:
        .:
        f:message:
        f:name:
        f:query:
        f:type:
    Manager:      kubectl-client-side-apply
    Operation:    Update
    Time:         2021-12-15T12:53:41Z
    API Version:  datadoghq.com/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:finalizers:
          .:
          v:"finalizer.monitor.datadoghq.com":
      f:spec:
        f:options:
        f:tags:
      f:status:
        .:
        f:conditions:
          .:
          k:{"type":"Active"}:
            .:
            f:lastTransitionTime:
            f:lastUpdateTime:
            f:message:
            f:status:
            f:type:
          k:{"type":"Created"}:
            .:
            f:lastTransitionTime:
            f:lastUpdateTime:
            f:message:
            f:status:
            f:type:
        f:created:
        f:creator:
        f:currentHash:
        f:downtimeStatus:
        f:id:
        f:monitorState:
        f:monitorStateLastTransitionTime:
        f:monitorStateLastUpdateTime:
        f:primary:
        f:syncStatus:
    Manager:         manager
    Operation:       Update
    Time:            2021-12-15T12:53:41Z
  Resource Version:  91886916
  UID:               91c8a22f-818d-40ee-a014-dcf9f8c84a52
Spec:
  Message:  Number of replicas for HPA hello-world-web in namespace hello-world, has reached maximum threshold @slack-team-example
  Name:     [kubernetes] MARCTEST hello-world-web HPA reached max replicas
  Options:
  Query:  avg(last_5m):avg:kubernetes_state.hpa.desired_replicas{hpa:hello-world-web,kube_namespace:hello-world} / avg:kubernetes_state.hpa.max_replicas{hpa:hello-world-web,kube_namespace:hello-world} >= 1
  Tags:
    service:hello-world
    generated:kubernetes
  Type:  query alert
Status:
  Conditions:
    Last Transition Time:  2021-12-15T12:53:41Z
    Last Update Time:      2021-12-15T13:17:41Z
    Message:               DatadogMonitor ready
    Status:                True
    Type:                  Active
    Last Transition Time:  2021-12-15T12:53:41Z
    Last Update Time:      2021-12-15T12:53:41Z
    Message:               DatadogMonitor Created
    Status:                True
    Type:                  Created
  Created:                 2021-12-15T12:53:41Z
  Creator:                 redacted
  Current Hash:            a4ed04577b43c8b209ec6c3bb489b179
  Downtime Status:
  Id:                                  58175053
  Monitor State:                       OK
  Monitor State Last Transition Time:  2021-12-15T12:55:41Z
  Monitor State Last Update Time:      2021-12-15T13:17:41Z
  Primary:                             true
  Sync Status:                         OK
Events:
  Type    Reason                 Age   From            Message
  ----    ------                 ----  ----            -------
  Normal  Create DatadogMonitor  24m   DatadogMonitor  hello-world/hello-world-web

A brief look at the code suggests that a hash of the specs are being compared so I'm surprised the operator isn't picking this up?

Dec 15 '21 13:12 mf-lit

Hi @mf-lit , apologies for the delay. I just tried reproducing the issue using your example DatadogMonitor and I wasn't able to; changing @target1 to @target2 and back several times worked as expected. If you're still experiencing the issue can you check the controller logs for anything unusual? Thanks.

Jan 19 '22 13:01 celenechang

Hi @celenechang sorry it's taken so long to get back to you, it seems GitHub has given up on sending me notifications.

I've just tried this again, I applied the exact manifest above, then I go to the monitoring dashboard and edit the new monitor changing @target1 to @target2. I then wait for the "Last Update Time" for the datadogmonitor resource to update, but the monitor remains on @target2 whilst the k8s resource continues to have @target1

There's nothing remarkable in the operator log either I'm afraid. Just:

{"level":"INFO","ts":"2022-02-04T14:33:01Z","logger":"controllers.DatadogMonitor","msg":"Reconciling DatadogMonitor","datadogmonitor":"hello-world/hello-world-web"}
{"level":"INFO","ts":"2022-02-04T14:33:01Z","logger":"controllers.DatadogMonitor","msg":"Reconciling DatadogMonitor","datadogmonitor":"hello-world/hello-world-web"}
{"level":"INFO","ts":"2022-02-04T14:33:06Z","logger":"controllers.DatadogMonitor","msg":"Reconciling DatadogMonitor","datadogmonitor":"hello-world/hello-world-web"}

It's not clear why there's three of these logs in such a short space of time?

Feb 04 '22 14:02 mf-lit

AFAIK the update is only triggered if hashed spec of the monitor differs from the hash in status. Nowhere during the reconciliation loop I see the monitors from DD are compared to k8s state.

May 16 '22 10:05 ymatsiuk

I'm using version:

    app.kubernetes.io/version: 0.8.1
    helm.sh/chart: datadog-operator-0.8.5

Seeing the same thing, I was going to open an issue asking if this was desired behavior or not.

This is what I did: Changed Delay monitor evaluation by x seconds in the UI. Operator never changed it back. Even with Debug logging this is all I see: {Monitor ID: 77955819, Monitor Name: test-dd-monitor-crd, Monitor Namespace: test, datadogmonitor: test/test-dd-monitor-crd, level: DEBUG, logger: controllers.DatadogMonitor, msg: Synced DatadogMonitor state, ts: 2022-07-25T15:07:32Z}

But more problematic, I tried reapplying the DatadogMonitor resource (to see if that would force it to reapply settings) and it still didn't change it! I had to modify a field of the resource config and reapply and then it changed it back.

I also tried just modifying the resources metadata.labels to see if that would trigger a sync to the config but it did not. I even tried modifying spec.labels but that also did nothing.

I needed to change the actual monitor config and apply for it to sync.

Jul 25 '22 15:07 red8888

@celenechang are there any updates on this? This has prevented me from using the monitor CRD.

The monitor resource should be asserting the desired state during the reconciliation loop right?

If users can modify it in the UI and the operator won't revert those changes I can't use it. Even if I'm locking monitors it just opens up a huge troubleshooting nightmare if I can't be 100% confident the desired state is being asserted continuously.

Jan 10 '23 16:01 red8888

I'm also experiencing this. I was also assuming the operator would reapply the monitor as specified in kubernetes if a change was made in the UI.

This comment seems spot on; this behavior doesn't seem to be coded.

Jan 31 '23 16:01 tylermichael

Same here, I have created a monitor using operator and remove it after from UI console, but the reconcile didn't happened. I also tried to force the recreate using kubectl get -o json datadogmonitor <NAME> | kubectl replace -f - but no changes. This behavior is the expected?

Mar 07 '23 11:03 willianccs

Apologies for the delay on updating the issue. It should be fixed by https://github.com/DataDog/datadog-operator/pull/791

Jul 10 '23 15:07 celenechang

datadog-operator datadog-operator copied to clipboard

Monitors modified out-of-band are not reconcilled

datadog-operator
datadog-operator copied to clipboard