etcd-druid Druid Refactor to Address Multiple Controller Conflicts

How to categorize this PR?

/area control-plane dev-productivity quality /kind api-change enhancement cleanup test

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #728
Fixes #759
Changes made as part of https://github.com/gardener/etcd-druid/pull/772 have also been included in this PR.

Release note:

Custodian controller has now been removed in favour of etcd status reconciliation handled by etcd controller. CLI flags `--custodian-workers` and `--custodian-sync-period` have now been removed, and are no longer recognised by etcd-druid.

CLI flag `--workers` has now been renamed to `--etcd-workers`. Additionally, etcd controller also accepts new CLI flags `enable-etcd-spec-auto-reconcile` to control how and when the etcd spec is reconciled, and `etcd-status-sync-period` to specify the duration after which an event will be re-queued to ensure etcd status reconciliation. CLI flag `ignore-operation-annotation` has been deprecated, and will be removed in an upcoming release.

CLI flag `--metrics-addr` is now deprecated. Please use `--metrics-bind-address` and `--metrics-port` instead.

A new validating webhook named `sentinel` has been introduced to safeguard resources created by etcd-druid. A new annotation `druid.gardener.cloud/disable-resource-protection` has been introduced, which if set, tells sentinel webhook to allow manual changes by an operator on any resource managed by etcd-druid. This webhook can be enabled by passing CLI flag `--enable-sentinel-webhook` to the etcd-druid binary. Additionally, CLI flags `--webhook-server-bind-address`, `--webhook-server-port` and `--webhook-server-tls-server-cert-dir` need to be passed when enabling the webhook, which enforces TLS communication using the given certs.

Annotation `druid.gardener.cloud/ignore-reconciliation` has been marked as deprecated. Please use `druid.gardener.cloud/suspend-etcd-spec-reconcile` instead, which provides the same behavior.

All packages under `/pkg` and `/controllers` directories have now been moved to new parent `/internal` directory.

The component model used for deploying resources has now been replaced with a simplified `ResourceOperator` model, found under `/internal/operator`.

Etcd resource status now includes field `LastOperation` to indicate the last operation performed on the etcd resource. This includes a unique `RunID` to help sift through logs containing the specific `RunID`, improving debuggability. Every reconciler run generates a unique `RunID`.

Etcd resource status now includes field `LastErrors` to indicate any errors encountered in the last reconciliation of the etcd resource. Custom error codes have been introduced to help capture contextual information from the reconciliation run.

Labels on druid-managed resources are now streamlined, and no longer include `name` and `instance`. Instead, these are now standard labels `app.kubernetes.io/managed-by` and `app.kubernetes.io/part-of`, as [recommended by Kubernetes](https://kubernetes.io/docs/concepts/overview/working-with-objects/common-labels/#labels). Additionally, `app.kubernetes.io/component` label is also used to set the type of the component for an etcd cluster.

Scale-up logic for single-node etcd clusters with peerTLS disabled to multi-node etcd clusters with peerTLS enabled, has been improved by making it deterministic and eliminates an unnecessary restart of the first etcd member, thus making this process faster and error-free.

A new condition `DataVolumesReady` has been introduced in `etcd.Status` to capture and report PVC warnings.

Etcd pods now mount files with `DefaultMode` set to `0640`.

etcd-backup-restore container was started with SYS_PTRACE linux capability. This prevented creating etcd cluster with [Pod Security Standards](https://kubernetes.io/docs/concepts/security/pod-security-standards/). This linux capability has now been removed as it is no longer required.

Volume mounts for the etcd StatefulSet have now been fixed, to allow individually specifying TLS secrets for the etcd and backup-restore servers. CA and TLS certificates used for etcd client-server communication, relevant to the container that they are mounted on, can be found at `/var/etcd/ssl/`. CA and TLS certificates used for etcd peer communication, relevant to the container that they are mounted on, can be found at `/var/etcd/ssl/peer`. CA and TLS certificates used for etcd-backup-restore client-server communication, relevant to the container that they are mounted on, can be found at `/var/etcdbr/ssl`.

The backup secret mounted onto the backup-restore container in the etcd pod, used by etcd-backup-restore process, is now mounted onto `/var/backup-store` instead of `/var/etcd-backup`.

We are moving towards using golang native tests. This also allowed us to relook at the unit and integration tests that we have. In this PR we have only partially introduced comprehensive golang native tests for specific packages (`internal/operator`, `internal/webhook`, `internal/controller/etcd/` and `internal/utils/`). We have also added comprehensive integration tests for etcd controller and the new IT tests are present at `test/it/controller/etcd`. In future PRs we will replace the ginkgo based tests and replace it with native golang tests for rest of the packages as well.

Apr 02 '24 08:04 unmarshall

⚠️ GitGuardian has uncovered 3 secrets following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secrets in your pull request

GitGuardian id	Secret	Commit	Filename
-	RSA Private Key	e2bfea3ca14e1583a0a5785c330c45df68f9e3a5	charts/druid/resources/server.key	View secret
-	RSA Private Key	e2bfea3ca14e1583a0a5785c330c45df68f9e3a5	charts/druid/resources/ca.key	View secret
-	RSA Private Key	e2bfea3ca14e1583a0a5785c330c45df68f9e3a5	charts/druid/templates/secret-server-tls-crt.yaml	View secret

🛠 Guidelines to remediate hardcoded secrets

Understand the implications of revoking this secret by investigating where it is used in your code.
Replace and store your secrets safely. Learn here the best practices.
Revoke and rotate these secrets.
If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider

following these best practices for managing and storing secrets including API keys and other credentials
install secret detection on pre-commit to catch secret before it leaves your machine and ease remediation.

^{🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.}

Apr 02 '24 08:04 gitguardian[bot]

make fmt passes

make check passes:

 Generating CRDs for druid.gardener.cloud group
 > Check
 Executing golangci-lint
 Executing gofmt/goimports
 All checks successful

make test passes: druid-pr-777-7d91607-unit.log
make test-integration passes: druid-pr-777-7d91607-int.log
make test-e2e passes: druid-pr-777-7d91607-e2e.log

Apr 05 '24 14:04 shreyas-s-rao

@unmarshall You need rebase this pull request with latest master branch. Please check.

Apr 16 '24 09:04 gardener-robot

/retest

Apr 22 '24 09:04 shreyas-s-rao

/assign

Apr 25 '24 05:04 seshachalam-yv

/assign

Apr 30 '24 04:04 ishan16696

@renormalize thanks for your initial review. I have addressed them, and mainly made changes to the helm chart, by making all CLI flags configurable.

May 03 '24 17:05 shreyas-s-rao

/test pull-etcd-druid-e2e-kind

May 06 '24 06:05 unmarshall

@ashwani2k @renormalize I have addressed your review comments. PTAL

May 07 '24 09:05 shreyas-s-rao

/retest

May 07 '24 13:05 shreyas-s-rao

@shreyas-s-rao

Deployed etcd-druid using make deploy on my local kind setup.

The multi-node etcd cluster has been created and is now ready.

I then attempted to scale down the StatefulSet replicas to 0, and the operation was successful. However, I was expecting sentinel to reject the StatefulSet update that reduced the replicas to zero. Because by default we are enabling the sentinel when we use make deploy.

Is my understanding correct?

May 10 '24 08:05 seshachalam-yv

@shreyas-s-rao

Deployed etcd-druid using make deploy on my local kind setup.

The multi-node etcd cluster has been created and is now ready.

I then attempted to scale down the StatefulSet replicas to 0, and the operation was successful. However, I was expecting sentinel to reject the StatefulSet update that reduced the replicas to zero. Because by default we are enabling the sentinel when we use make deploy.

Is my understanding correct?

Following the previous scenario, I attempted to reconcile the etcd to scale the StatefulSet back up to 3 replicas. Interestingly, it did not scale up, and I observed error logs from etcd-druid:

{"level":"error","ts":"2024-05-10T08:49:03.036Z","logger":"etcd-controller","msg":"failed to sync etcd resource","etcd":{"name":"etcd-test","namespace":"default"},"operation":"reconcileSpec","runID":"283e0ad1-8527-425e-89ba-94adaf6da9fd"," │
│ kind":"StatefulSet","error":"[Operation: Sync, Code: ERR_SYNC_STATEFULSET] StatefulSet.apps \"etcd-test\" is invalid: spec: Forbidden: updates to statefulset spec for fields other than 'replicas', 'ordinals', 'template', 'updateStrategy',  │
│ 'persistentVolumeClaimRetentionPolicy' and 'minReadySeconds' are forbidden","stacktrace":"github.com/gardener/etcd-druid/internal/controller/etcd.(*Reconciler).syncEtcdResources\n\t/go/src/github.com/gardener/etcd-druid/internal/controller │
│ /etcd/reconcile_spec.go:60\ngithub.com/gardener/etcd-druid/internal/controller/etcd.(*Reconciler).triggerReconcileSpecFlow\n\t/go/src/github.com/gardener/etcd-druid/internal/controller/etcd/reconcile_spec.go:27\ngithub.com/gardener/etcd-dr │
│ uid/internal/controller/etcd.(*Reconciler).reconcileSpec\n\t/go/src/github.com/gardener/etcd-druid/internal/controller/etcd/reconciler.go:157\ngithub.com/gardener/etcd-druid/internal/controller/etcd.(*Reconciler).Reconcile\n\t/go/src/githu │
│ b.com/gardener/etcd-druid/internal/controller/etcd/reconciler.go:103\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.g │
│ o:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/contr │
│ oller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs │
│ .k8s.io/[email protected]/pkg/internal/controller/controller.go:227"}

May 10 '24 08:05 seshachalam-yv

@unmarshall: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-etcd-druid-e2e-kind-nondistroless-etcd	1eb40ade43376ce3632d015315ad7d66c2612d1f	link	true	`/test pull-etcd-druid-e2e-kind-nondistroless-etcd`

Full PR test history. Your PR dashboard. Command help for this repository. Please help us cut down on flakes by linking this test failure to an open flake report or filing a new flake report if you can't find an existing one. Also see our testing guideline for how to avoid and hunt flakes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

May 14 '24 09:05 gardener-prow[bot]

@seshachalam-yv I have now fixed the behavior of sentinel webhook for handling updates to the statefulset /scale subresource, which you had reported in comment https://github.com/gardener/etcd-druid/pull/777#issuecomment-2104203304, via commit https://github.com/gardener/etcd-druid/pull/777/commits/f75c718cfa6b966328bd6b37843ab1d69d6e9a56.

Essentially, as you can see here, webhook will not be invoked for */scale subresource if an objectSelector is specified. This is because the objectSelector specifies matchLabels, but the scale subresource does not have these labels (since the parent resource has them). Hence, I have enhanced sentinel webhook to handle this special case for statefulsets/scale subresource.

Once you pull the latest changes from the PR and run druid and deploy an etcd resource, you will now see that kubectl scale command on the sts will be denied by sentinel webhook, as expected. PTAL.

May 15 '24 14:05 shreyas-s-rao

@ishan16696 @seshachalam-yv @anveshreddy18 I have addressed your review comments in https://github.com/gardener/etcd-druid/pull/777/commits/53d5fef76966e9ac25beb675dd667319b8f1fbdc, https://github.com/gardener/etcd-druid/pull/777/commits/b9978b4e77ab5ee782f2badbea21c8364f19c947, https://github.com/gardener/etcd-druid/pull/777/commits/d41abe1f53e2763592226b9dceedd279f75f2881 respectively.

May 17 '24 03:05 shreyas-s-rao

/test pull-etcd-druid-e2e-kind

May 24 '24 16:05 unmarshall

/retest

May 29 '24 11:05 shreyas-s-rao

/test pull-etcd-druid-e2e-kind-nondistroless-etcd

Jun 07 '24 09:06 unmarshall

/test pull-etcd-druid-e2e-kind

Jun 09 '24 12:06 unmarshall

/test pull-etcd-druid-e2e-kind

Jun 10 '24 04:06 unmarshall

Included changes and release note from https://github.com/gardener/etcd-druid/pull/781

Jun 12 '24 06:06 shreyas-s-rao

/test pull-etcd-druid-e2e-kind

Jun 14 '24 11:06 ishan16696

List of manual tests executed:

TEST NAME	Druid Auto-Reconcile	Single/Multi Node	Backups (provider)	Etcd Client TLS	Etcd Peer TLS	EtcdBR TLS	TEST RESULT
Deploy etcd, check reconciliation, hibernate, unhibernate, delete etcd	FALSE	Single	NA	FALSE	FALSE	FALSE	TRUE
Deploy etcd, check reconciliation, hibernate, unhibernate, delete etcd	FALSE	Single	NA	TRUE	TRUE	TRUE	TRUE
Deploy etcd, check reconciliation, hibernate, unhibernate, delete etcd	FALSE	Single	AWS	TRUE	TRUE	TRUE	TRUE
Deploy etcd, check reconciliation, hibernate, unhibernate, delete etcd	FALSE	Multi	NA	FALSE	FALSE	FALSE	TRUE
Deploy etcd, check reconciliation, hibernate, unhibernate, delete etcd	FALSE	Multi	NA	TRUE	TRUE	TRUE	TRUE
Deploy etcd, check reconciliation, hibernate, unhibernate, delete etcd	FALSE	Multi	AWS	TRUE	TRUE	TRUE	TRUE
Deploy etcd, check reconciliation, hibernate, unhibernate, delete etcd	FALSE	Multi	GCP	TRUE	TRUE	TRUE	TRUE
Deploy etcd, check reconciliation, hibernate, unhibernate, delete etcd	FALSE	Multi	Azure	TRUE	TRUE	TRUE	TRUE
Deploy etcd, check reconciliation, hibernate, unhibernate, delete etcd	FALSE	Multi	Openstack	TRUE	TRUE	TRUE	TRUE
Deploy etcd, check reconciliation, hibernate, unhibernate, delete etcd	FALSE	Multi	Local	TRUE	TRUE	TRUE	TRUE
Perform etcd spec changes, check if reconciliation triggered	FALSE	Multi	GCP	TRUE	TRUE	TRUE	TRUE
Scale-up etcd from single-node non-TLS to multi-node non-TLS, hibernate, unhibernate	FALSE	Single	GCP	FALSE	FALSE	FALSE	TRUE
Scale-up etcd from single-node non-TLS to multi-node TLS, hibernate, unhibernate	FALSE	Single	GCP	FALSE	FALSE	FALSE	TRUE
Scale-up etcd from single-node TLS to multi-node TLS, hibernate, unhibernate	FALSE	Single	NA	TRUE	TRUE	TRUE	TRUE
Upgrade druid from master to #777, check status updates, add reconcile annotation, check reconciliation	FALSE	Multi	GCP	TRUE	TRUE	TRUE	TRUE
Deploy etcdcopybackupstask, check success	FALSE	Multi	Local	TRUE	TRUE	TRUE	TRUE
Configure compaction with low threshold, populate etcd, check if compaction jobs are triggered and run	FALSE	Single	AWS	TRUE	TRUE	TRUE	TRUE
Deploy etcd, check reconciliation, hibernate, unhibernate, delete etcd	TRUE	Multi	GCP	TRUE	TRUE	TRUE	TRUE
Perform etcd spec changes, check if reconciliation triggered	TRUE	Multi	GCP	TRUE	TRUE	TRUE	TRUE

Jun 23 '24 18:06 shreyas-s-rao

New changes (minor fixes) as part of https://github.com/gardener/etcd-druid/pull/777/commits/4843c75c1f3f4cc8a45cf98639dd41c632b95ead:

Fix scale subresource to use selectorpath as Etcd.spec.selector rather than the deprecated Etcd.status.labelSelector
Add missing delete role to druid
Handle case when auto-reconcile is disabled, and Etcd spec is updated from 1 to 3 replicas, but has not yet been reconciled. During such a case, the expected members in the cluster are only 1 (actual current replicas), and not 3 (spec.replicas). This can be checked by checking whether the etcd observedGeneration matches the generation. This check is now added to Etcd.status.ready check as well as computing Etcd.status.members and consequently the AllMembersReady condition.
Minor fix in skaffold-based helm deployment, to set certain compaction-related flags only upon profile activation, and not always.

Jun 23 '24 18:06 shreyas-s-rao

/retest

Jun 24 '24 07:06 shreyas-s-rao

/retest

Jun 24 '24 08:06 shreyas-s-rao

I ran the e2e tests locally and these pass for me.

Jun 24 '24 08:06 unmarshall

etcd-druid etcd-druid copied to clipboard

Druid Refactor to Address Multiple Controller Conflicts

⚠️ GitGuardian has uncovered 3 secrets following the scan of your pull request.

etcd-druid
etcd-druid copied to clipboard