datadog-operator icon indicating copy to clipboard operation
datadog-operator copied to clipboard

cannot patch "datadogagents.datadoghq.com" with kind CustomResourceDefinition when updating operator via Helm & Terraform

Open GenPage opened this issue 3 years ago • 6 comments

Output of the info page (if this is a bug)

Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
  ~ update in-place

Terraform will perform the following actions:

  # module.datadog-operator[0].helm_release.datadog-operator will be updated in-place
  ~ resource "helm_release" "datadog-operator" {
        id                         = "datadog"
        name                       = "datadog"
      + pass_credentials           = false
      ~ version                    = "0.7.10" -> "0.8.6"
        # (26 unchanged attributes hidden)
    }

Plan: 0 to add, 1 to change, 0 to destroy.

module.datadog-operator[0].helm_release.datadog-operator: Modifying... [id=datadog]
module.datadog-operator[0].helm_release.datadog-operator: Still modifying... [id=datadog, 10s elapsed]
╷
│ Error: cannot patch "datadogagents.datadoghq.com" with kind CustomResourceDefinition: CustomResourceDefinition.apiextensions.k8s.io "datadogagents.datadoghq.com" is invalid: spec.validation: Forbidden: top-level and per-version schemas are mutually exclusive
│
│   with module.datadog-operator[0].helm_release.datadog-operator,
│   on modules/datadog-operator/main.tf line 62, in resource "helm_release" "datadog-operator":
│   62: resource "helm_release" "datadog-operator" {
│
╵
Releasing state lock. This may take a few moments...
ERRO[0131] 1 error occurred:
        * exit status 1

Describe what happened: Trying to update to latest helm chart to get newer agent versions

Describe what you expected: Operator to apply successfully

Steps to reproduce the issue: Upgrade from 0.7.10 -> 0.8.6

Additional environment details (Operating System, Cloud provider, etc): AWS EKS 1.20 Terraform 1.0.8

Agent CRD config (templated with Terraform)

---
apiVersion: datadoghq.com/v1alpha1
kind: DatadogAgent
metadata:
  name: datadog
  namespace: ${namespace}
spec:
  agent:
    config:
      collectEvents: true
      leaderElection: true
      tolerations:
      - operator: Exists
      podLabelsAsTags: {
        "*": "kube_%%label%%"
      }
      tags:
      %{~ for tag in metric_tags ~}
        - ${tag}
      %{~ endfor ~}
    image:
      name: ${agent_image_tag}
    log:
      enabled: ${log_monitoring}
    process:
      enabled: true
      processCollectionEnabled: true
    systemProbe:
      enabled: true
    security:
      compliance:
        enabled: ${security_monitoring}
      runtime:
        enabled: ${security_monitoring}
  clusterAgent:
    enabled: true
    config:
      collectEvents: true
      clusterChecksEnabled: true
      externalMetrics:
        enabled: true
      volumeMounts:
        - mountPath: "/etc/datadog-agent/conf.d/mysql.d/"
          name: mysql-conf
          readOnly: true
        - mountPath: "/etc/datadog-agent/conf.d/postgres.d/"
          name: postgres-conf
          readOnly: true
      volumes:
        - name: mysql-conf
          projected:
            sources:
              - secret:
                  name: mysql-${db}-conf
        - name: postgres-conf
          projected:
            sources:
              - secret:
                  name: pgsql-${db}-conf
    image:
      name: ${cluster_agent_image_tag}
  clusterName: ${cluster_name}
  credentials:
    apiSecret:
      secretName: ${cred_secret_name}
      keyName: api-key
    appSecret:
      secretName: ${cred_secret_name}
      keyName: app-key
  features:
    kubeStateMetricsCore:
      enabled: true
    networkMonitoring:
      enabled: ${network_monitoring}
    orchestratorExplorer:
      enabled: true
      extraTags:
        - "datacenter:${datacenter}"
    prometheusScrape:
      enabled: false
  registry: public.ecr.aws/datadog

Datadog Operator config:

---
apiKeyExistingSecret: ${secret_name}
appKeyExistingSecret: ${secret_name}
datadogMonitor:
  enabled: ${enable_datadog_monitor}
supportExtendedDaemonset: ${enable_extended_daemonset}
registry: public.ecr.aws/datadog
watchNamespaces:
  - ""

GenPage avatar Aug 24 '22 15:08 GenPage

I noticed the new CRD for DatadogAgent v2alpha1 but I wouldn't expect it to error at the operator level unless there was an issue with the CRD spec itself.

GenPage avatar Aug 24 '22 17:08 GenPage

Update: I was able to update to 0.8.1 successfully. However, any update after I get a new error when deploying the DatadogAgent CRD:

Attribute not found in schema

  with module.datadog-agent[0].kubernetes_manifest.datadog-agent-operator,
  on modules/datadog-agent/main.tf line 18, in resource "kubernetes_manifest" "datadog-agent-operator":
  18: resource "kubernetes_manifest" "datadog-agent-operator" {

Unable to find schema type for attribute:
spec.clusterAgent.config.volumes[0].ephemeral.readOnly

╷
│ Error: Failed to transform Tuple element into Tuple element type
│
│   with module.datadog-agent[0].kubernetes_manifest.datadog-agent-operator,
│   on modules/datadog-agent/main.tf line 18, in resource "kubernetes_manifest" "datadog-agent-operator":
│   18: resource "kubernetes_manifest" "datadog-agent-operator" {
│
│ Error (see above) at attribute:
│ spec.clusterAgent.config.volumes[0]
╵
╷
│ Error: Failed to transform Object element into Object element type
│
│   with module.datadog-agent[0].kubernetes_manifest.datadog-agent-operator,
│   on modules/datadog-agent/main.tf line 18, in resource "kubernetes_manifest" "datadog-agent-operator":
│   18: resource "kubernetes_manifest" "datadog-agent-operator" {
│
│ Error (see above) at attribute:
│ spec.clusterAgent.config.volumes
╵
╷
│ Error: Failed to transform Object element into Object element type
│
│   with module.datadog-agent[0].kubernetes_manifest.datadog-agent-operator,
│   on modules/datadog-agent/main.tf line 18, in resource "kubernetes_manifest" "datadog-agent-operator":
│   18: resource "kubernetes_manifest" "datadog-agent-operator" {
│
│ Error (see above) at attribute:
│ spec.clusterAgent.config
╵
╷
│ Error: Failed to transform Object element into Object element type
│
│   with module.datadog-agent[0].kubernetes_manifest.datadog-agent-operator,
│   on modules/datadog-agent/main.tf line 18, in resource "kubernetes_manifest" "datadog-agent-operator":
│   18: resource "kubernetes_manifest" "datadog-agent-operator" {
│
│ Error (see above) at attribute:
│ spec.clusterAgent
╵
╷
│ Error: Failed to transform Object element into Object element type
│
│   with module.datadog-agent[0].kubernetes_manifest.datadog-agent-operator,
│   on modules/datadog-agent/main.tf line 18, in resource "kubernetes_manifest" "datadog-agent-operator":
│   18: resource "kubernetes_manifest" "datadog-agent-operator" {
│
│ Error (see above) at attribute:
│ spec

GenPage avatar Aug 25 '22 04:08 GenPage

Okay, I tried using the new schema to redeploy the agent, but it either isn't ready or I'm just misunderstanding exactly what the upgrade path looks like.

The operator at helm-chart version 0.8.6 does not install the v2alpha1 CRD. Trying to use the v2alpha1 spec but marked as v1alpha1 also results in an error.

[~]$ kubectl api-resources | grep datadog
datadogagents                             dd            datadoghq.com/v1alpha1                 true         DatadogAgent
datadogmetrics                                          datadoghq.com/v1alpha1                 true         DatadogMetric
datadogmonitors                                         datadoghq.com/v1alpha1                 true         DatadogMonitor

GenPage avatar Aug 29 '22 15:08 GenPage

It appears the CRD at datadog-operator/bundle/manifests/datadoghq.com_datadogagents.yaml is not updated to the latest CRD from the helm-chart repo though I'm not sure if its the source of truth for what gets bundled.

It is odd though, the original error reported in this issue seems to indicate the new CRD spec trying to be patched. I'm just operating with limited information based on the errors presented to me.

I was able to completely unblock myself by upgrading the operator to 0.8.1 first, deleting the DatadogAgent deployment, and THEN redeploying the v1alpha1 spec after upgrading to 0.8.6.

GenPage avatar Aug 29 '22 16:08 GenPage

also curious about this

zcicala avatar Aug 29 '22 17:08 zcicala

To clarify testing on another cluster from 7.10. I did some more thorough testing, I wasn't able to upgrade straight to 0.8.6 even after deleting the datadogagents resource. I stepped down each version until I was able to upgrade to 0.8.3 which then allowed me to apply 0.8.6

0.8.4 & 0.8.5 error differs from the straight 0.8.6 error:

│ Error: template: datadog-operator/charts/datadog-crds/templates/datadoghq.com_datadogmonitors_v1beta1.yaml:1:41: executing "datadog-operator/charts/datadog-crds/templates/datadoghq.com_datadogmonitors_v1beta1.yaml" at <semverCompare "<=21" .Capabilities.KubeVersion.Minor>: error calling semverCompare: Invalid Semantic Version
│
│   with module.datadog-operator[0].helm_release.datadog-operator,
│   on modules/datadog-operator/main.tf line 62, in resource "helm_release" "datadog-operator":
│   62: resource "helm_release" "datadog-operator" {
│

GenPage avatar Aug 29 '22 18:08 GenPage

@GenPage So when i tried to upgrade from 0.8.0 to 0.8.6 i got this error

Helm install failed: template: datadog-operator/charts/datadog-crds/templates/datadoghq.com_datadogmonitors_v1beta1.yaml:1:41: executing "datadog-operator/charts/datadog-crds/templates/datadoghq.com_datadogmonitors_v1beta1.yaml" at <semverCompare "<21" .Capabilities.KubeVersion.Minor>: error calling semverCompare: Invalid Semantic Version

So if i understand upgrading from 0.8.0 to 0.8.3 should be fine and once thats done i can upgrade to 0.8.6

Is that the upgrade path i should try..Also do i still need to delete anything seperately?

rajivchirania avatar Jan 23 '23 08:01 rajivchirania

Yes, that's how I was able to get Helm to apply successfully. I did not delete anything.

GenPage avatar Jan 23 '23 21:01 GenPage

Hello, sorry for not getting to the issue on time. Operator v0.x is no longer supported and we recommend migrating to most recent version of v1.x.

Please open a new issue if there is anything blocking migration or you experience same issue in v1.x.

levan-m avatar Oct 16 '23 21:10 levan-m