datadog-operator
datadog-operator copied to clipboard
Duplicate SLO created for each `DatadogSLO`
Describe what happened:
I've confirmed I'm only running a single Datadog operator in my K8s cluster, but it seems each DatadogSLO
creates multiple SLOs in Datadog.
Running 1.3.0
of the operator, creating an example DatadogSLO
:
apiVersion: datadoghq.com/v1alpha1
kind: DatadogSLO
metadata:
name: text-xyz
namespace: test
spec:
description: Error SLO for test-xyz
name: Error SLO for test-xyz
query:
denominator: sum:trace.pyramid.request.hits{service:test-xyz, env:test}.as_count()
numerator: sum:trace.pyramid.request.hits{service:test-xyz, env:test}.as_count()
- sum:trace.pyramid.request.errors{service:test-xyz, env:test}.as_count()
tags:
- integration:kubernetes
- service:test-xyz
- env:test
- team:sre
- generated:kubernetes
targetThreshold: 99500m
timeframe: 7d
type: metric
results in multiple SLOs being created in Datadog:
Deleting the DatadogSLO
results in one of the SLOs being orphaned in Datadog.
Describe what you expected:
I expect a single DatadogSLO
resource to result in a single SLO created in Datadog.
Steps to reproduce the issue:
Install the Datadog Operator via Helm (chart version 1.4.1
) with following values:
datadogCRDs:
crds:
datadogSLOs: true
apiKeyExistingSecret: datadog-secret
appKeyExistingSecret: datadog-secret
datadogMonitor:
enabled: true
datadogSLO:
enabled: true
site: datadoghq.com
watchNamespaces:
- ""
Kubectl apply the example DatadogSLO
above.
Additional environment details (Operating System, Cloud provider, etc):
Hi, thanks for reporting this. We'll look into this on our end to try and see why multiple SLOs are getting created
I've also seen this issue using the 1.8.3 helm chart with the 1.7.0 operator.
Additionally, I was using Kyverno with a generate policy for DatadogSLOs and synchronization turned on. My target threshold was set to "99.0" and the datadog-operator controller would change it to "99", which caused Kyverno and the datadog-operator to fight back and forth changing it. The result was that I had around 40 duplicate SLOs as described in this issue. I only add all this to say that it seems that this problem gets exacerbated by updating the resource.
Thanks for the reporting the issue @paulbrassard-figure!
As mentioned here the fix addressed once specific case leading to duplication - namely concurrent reconciliation of the resource. With SLO Create API not being idempotent we can't guarantee that duplication won't happen. So it would be great if you could share more details about your setup, how to reproduce the issue with Kyverno and if possible without.