cloud-on-k8s User created secrets must not use same names as operator created secrets

Bug Report

What did you do?

Create a standalone Elastic Agent deployment per the docs.

1.1. Create secret:

apiVersion: v1
kind: Secret
metadata:
  name: system-cpu-config
stringData:
  agent.yml: |-
    inputs:
      - name: system-1
        revision: 1
        type: system/metrics
        use_output: default
        meta:
          package:
            name: system
            version: 0.9.1
        data_stream:
          namespace: default
        streams:
          - id: system/metrics-system.cpu
            data_stream:
              dataset: system.cpu
              type: metrics
            metricsets:
              - cpu
            cpu.metrics:
              - percentages
              - normalized_percentages
            period: 10s

1.2. Create agent:

apiVersion: agent.k8s.elastic.co/v1alpha1
kind: Agent
metadata:
  name: quickstart
spec:
  version: 8.0.0
  elasticsearchRefs:
  - name: quickstart
  daemonSet: {}
  configRef:
    secretName: system-cpu-config

Delete agent

What did you expect to see?

I expected to see the created secret remaining in the cluster.

What did you see instead? Under which circumstances?

I see that the secret is missing from the cluster.

Environment

ECK version:

1.9.1

Note: I have not tested this on 2.0, but I did check the release notes and didn't see anything that mentions this, so I'm assuming this issue also happens there.

Kubernetes information:
- On premise yes
- Cloud: GKE / EKS / AKS no
- Kubernetes distribution: Openshift / Rancher / PKS v1.21.8+rke2r2

Issue/Problem

The issue/problem with this is in a configuration where the user provides the secret, I would expect the user to have total control over the secret, however, as it currently stands the ECK operator also thinks it has control over the secret.

Use cases where this can be an issue:

Manual deletion of a CRD for a standalone agent. If a user wants to delete a standalone Elastic Agent CRD for some reason, they may not expect the secret with the config to be deleted as well, therefore if they go back to add the Agent CRD again, it won't work because the operator will have deleted the config secret. This can be a bad thing, if let's say a user is developing a new standalone config in a cluster, they would potentially lose work if they don't have it saved somewhere else.
Management of CRDs via a tool (terraform). If you use terraform to manage your Kubernetes deployments/clusters, and you do something like; Elastic Agent manifest depends on Secret config, so that the secret gets built before the Elastic Agent manifest does. Then if you run something like terraform destroy, it will fail because it deletes the Agent before the secret, and since the secret is deleted by the operator, then terraform can't delete the secret.

Feb 13 '22 19:02 BenB196

I have not been able to reproduce this issue. Are you sure there was not something else at play here? Also I wonder by what mechanism the deletion could possibly happen. We do not set an ownerReference on the secret. The only other mechanism to "clean up" secrets we have is a label based garbage collection that we apply on credentials/certificate secrets the operator creates itself. This mechanism uses the following labels:

https://github.com/elastic/cloud-on-k8s/blob/9be005d2a4b6e26f98ff9c520ea53ae4d6cc9455/pkg/controller/common/reconciler/secret.go#L25-L27

So unless you added those labels to your config secret I struggle to see how it could be deleted.

Feb 14 '22 10:02 pebrc

Hi @pebrc, did some further testing.

Here is what an applied secret looks like applied:

apiVersion: v1
data:
  agent.yml: <snipped>
kind: Secret
metadata:
  creationTimestamp: "2022-02-14T19:53:52Z"
  labels:
    agent.k8s.elastic.co/name: elastic-agent-qa-6abf7ea7
    common.k8s.elastic.co/type: agent
    eck.k8s.elastic.co/credentials: "true"
  managedFields:
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:data: {}
      f:type: {}
    manager: HashiCorp
    operation: Update
    time: "2022-02-14T19:53:52Z"
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:data:
        f:agent.yml: {}
      f:metadata:
        f:labels:
          .: {}
          f:agent.k8s.elastic.co/name: {}
          f:common.k8s.elastic.co/type: {}
          f:eck.k8s.elastic.co/credentials: {}
        f:ownerReferences:
          .: {}
          k:{"uid":"a69c7d88-90c8-4e43-81d7-7e6004f4be9a"}:
            .: {}
            f:apiVersion: {}
            f:blockOwnerDeletion: {}
            f:controller: {}
            f:kind: {}
            f:name: {}
            f:uid: {}
    manager: elastic-operator
    operation: Update
    time: "2022-02-14T19:54:05Z"
  name: elastic-agent-qa-6abf7ea7-agent-config
  namespace: elastic-agents-dev
  ownerReferences:
  - apiVersion: agent.k8s.elastic.co/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: Agent
    name: elastic-agent-qa-6abf7ea7
    uid: a69c7d88-90c8-4e43-81d7-7e6004f4be9a
  resourceVersion: "107594683"
  uid: 1025f998-08bd-4c63-8ed7-4ab6b0c45dc1
type: Opaque

After deleting the agent CRD, the secret was removed from the cluster.

I don't see any logs on the ECK operator about it deleting stuff (even with debug --log-verbosity=1), so I'm not sure how else to go about validating the issue.

I will be upgrading on the next 2.x.y release (whatever that will be) and can revalidate against that if wanted as well.

Feb 14 '22 20:02 BenB196

I think I misunderstood your initial bug report. I thought you were saying that the secret created and owned by the user (system-cpu-config in your example in the OP) had been removed.

After your last post I think you mean that the secret the operator creates based on the users secret is being removed ($AGENT_NAME-agent-config). The latter is completely expected and also correct I would say.

The flow is as follows:

the operator reads the user provided secret (system-cpu-config) and merges its content with configuration it manages itself like for example connection information for Elasticsearch
the resulting configuration is stored in a new secret following this naming scheme $AGENT_NAME-agent-config This secret is fully managed by the ECK operator and will be deleted once the Agent has been deleted
the config secret is mounted in the Agent Pods and the Agent is configured to take the configuration file contained in it.

Both points of issue you mention above do not apply here:

you do not lose any configuration work as the original secret is not touched, everything in $AGENT_NAME-agent-config is derivative information and can be reconstructed from the original source.
tools like terraform should not be affected as the $AGENT_NAME-agent-config secret should not be managed by terraform and the original secret system-cpu-config should still be there and can be managed by terraform.

Feb 14 '22 21:02 pebrc

Thanks @pebrc for the detailed explanation, I think I recognize the issue here. My OP, wasn't a 100% replica of my real-world setup (apologies for that copy paste error), and I think this has caused some issues. In my real-world setup, my secret name instead of being system-cpu-config is instead really ${var.ELASTIC_AGENT_NAME}-agent-config provided via a Terraform variable/resource. So, based on your explanation, it seems like what is happening here is as follows:

I create my secret that has the name ${var.ELASTIC_AGENT_NAME}-agent-config
The ECK operator reads this secret.
The ECK operator generates a new secret with the name $AGENT_NAME-agent-config (matches my secret ${var.ELASTIC_AGENT_NAME}-agent-config)
The ECK operator overwrites my secret with its secret, as both secrets have the same name.
I delete the Agent CRD
The ECK operator deletes "its" secret
My secret is deleted because "my" secret was the ECK operator's secret.

So, I guess that now that I know what I know there is probably some "bug", but may not be the original "issue".

I'd say either one of the below (or more) should be considered:

ECK operator instead of overwriting a non-managed secret, should error out that the secrets conflict
The docs should state that a user should avoid creating a secret with the name $AGENT_NAME-agent-config as the ECK operator will overwrite it.
The ECK operator should "trust" the user provided secret with this name, and not overwrite or try to manage the secret.

(There may be other routes as well that I'm not aware of/thinking of)

Feb 15 '22 00:02 BenB196

I think I am leaning toward your option 2.

Option 1. is difficult because we have multiple conflicting cases here:

the secret in the API Server is different from the one the operator is about to write because it was created by an earlier version of the operator and we have introduced a change
the secret in the API Server is different because a user (such as you) has created a secret with the same name the operator uses

Option 3 is difficult because we have to merge additional configuration into the secret to successfully run Elastic Agent

Feb 15 '22 09:02 pebrc