etcd-druid icon indicating copy to clipboard operation
etcd-druid copied to clipboard

Druid Refactor to Address Multiple Controller Conflicts

Open unmarshall opened this issue 10 months ago • 13 comments

How to categorize this PR?

/area control-plane dev-productivity quality /kind api-change enhancement cleanup test

What this PR does / why we need it:

Which issue(s) this PR fixes:

  • Fixes #728
  • Fixes #759
  • Changes made as part of https://github.com/gardener/etcd-druid/pull/772 have also been included in this PR.

Release note:

Custodian controller has now been removed in favour of etcd status reconciliation handled by etcd controller. CLI flags `--custodian-workers` and `--custodian-sync-period` have now been removed, and are no longer recognised by etcd-druid.
CLI flag `--workers` has now been renamed to `--etcd-workers`. Additionally, etcd controller also accepts new CLI flags `enable-etcd-spec-auto-reconcile` to control how and when the etcd spec is reconciled, and `etcd-status-sync-period` to specify the duration after which an event will be re-queued to ensure etcd status reconciliation. CLI flag `ignore-operation-annotation` has been deprecated, and will be removed in an upcoming release.
CLI flag `--metrics-addr` is now deprecated. Please use `--metrics-bind-address` and `--metrics-port` instead.
A new validating webhook named `sentinel` has been introduced to safeguard resources created by etcd-druid. A new annotation `druid.gardener.cloud/disable-resource-protection` has been introduced, which if set, tells sentinel webhook to allow manual changes by an operator on any resource managed by etcd-druid. This webhook can be enabled by passing CLI flag `--enable-sentinel-webhook` to the etcd-druid binary. Additionally, CLI flags `--webhook-server-bind-address`, `--webhook-server-port` and `--webhook-server-tls-server-cert-dir` need to be passed when enabling the webhook, which enforces TLS communication using the given certs. 
Annotation `druid.gardener.cloud/ignore-reconciliation` has been marked as deprecated. Please use `druid.gardener.cloud/suspend-etcd-spec-reconcile` instead, which provides the same behavior.
All packages under `/pkg` and `/controllers` directories have now been moved to new parent `/internal` directory.
The component model used for deploying resources has now been replaced with a simplified `ResourceOperator` model, found under `/internal/operator`.
Etcd resource status now includes field `LastOperation` to indicate the last operation performed on the etcd resource. This includes a unique `RunID` to help sift through logs containing the specific `RunID`, improving debuggability. Every reconciler run generates a unique `RunID`.
Etcd resource status now includes field `LastErrors` to indicate any errors encountered in the last reconciliation of the etcd resource. Custom error codes have been introduced to help capture contextual information from the reconciliation run.
Labels on druid-managed resources are now streamlined, and no longer include `name` and `instance`. Instead, these are now standard labels `app.kubernetes.io/managed-by` and `app.kubernetes.io/part-of`, as [recommended by Kubernetes](https://kubernetes.io/docs/concepts/overview/working-with-objects/common-labels/#labels). Additionally, `app.kubernetes.io/component` label is also used to set the type of the component for an etcd cluster.
Scale-up logic for single-node etcd clusters with peerTLS disabled to multi-node etcd clusters with peerTLS enabled, has been improved by making it deterministic and eliminates an unnecessary restart of the first etcd member, thus making this process faster and error-free.
A new condition `DataVolumesReady` has been introduced in `etcd.Status` to capture and report PVC warnings.
Etcd pods now mount files with `DefaultMode` set to `0640`.
etcd-backup-restore container was started with SYS_PTRACE linux capability. This prevented creating etcd cluster with [Pod Security Standards](https://kubernetes.io/docs/concepts/security/pod-security-standards/). This linux capability has now been removed as it is no longer required.
Volume mounts for the etcd StatefulSet have now been fixed, to allow individually specifying TLS secrets for the etcd and backup-restore servers. CA and TLS certificates used for etcd client-server communication, relevant to the container that they are mounted on, can be found at `/var/etcd/ssl/`. CA and TLS certificates used for etcd peer communication, relevant to the container that they are mounted on, can be found at `/var/etcd/ssl/peer`. CA and TLS certificates used for etcd-backup-restore client-server communication, relevant to the container that they are mounted on, can be found at `/var/etcdbr/ssl`.
The backup secret mounted onto the backup-restore container in the etcd pod, used by etcd-backup-restore process, is now mounted onto `/var/backup-store` instead of `/var/etcd-backup`.
We are moving towards using golang native tests. This also allowed us to relook at the unit and integration tests that we have. In this PR we have only partially introduced comprehensive golang native tests for specific packages (`internal/operator`, `internal/webhook`, `internal/controller/etcd/` and `internal/utils/`). We have also added comprehensive integration tests for etcd controller and the new IT tests are present at `test/it/controller/etcd`. In future PRs we will replace the ginkgo based tests and replace it with native golang tests for rest of the packages as well.

unmarshall avatar Apr 02 '24 08:04 unmarshall

⚠️ GitGuardian has uncovered 3 secrets following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secrets in your pull request
GitGuardian id GitGuardian status Secret Commit Filename
- RSA Private Key e2bfea3ca14e1583a0a5785c330c45df68f9e3a5 charts/druid/resources/server.key View secret
- RSA Private Key e2bfea3ca14e1583a0a5785c330c45df68f9e3a5 charts/druid/resources/ca.key View secret
- RSA Private Key e2bfea3ca14e1583a0a5785c330c45df68f9e3a5 charts/druid/templates/secret-server-tls-crt.yaml View secret
🛠 Guidelines to remediate hardcoded secrets
  1. Understand the implications of revoking this secret by investigating where it is used in your code.
  2. Replace and store your secrets safely. Learn here the best practices.
  3. Revoke and rotate these secrets.
  4. If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

gitguardian[bot] avatar Apr 02 '24 08:04 gitguardian[bot]

shreyas-s-rao avatar Apr 05 '24 14:04 shreyas-s-rao

@unmarshall You need rebase this pull request with latest master branch. Please check.

gardener-robot avatar Apr 16 '24 09:04 gardener-robot

/retest

shreyas-s-rao avatar Apr 22 '24 09:04 shreyas-s-rao

/assign

seshachalam-yv avatar Apr 25 '24 05:04 seshachalam-yv

/assign

ishan16696 avatar Apr 30 '24 04:04 ishan16696

@renormalize thanks for your initial review. I have addressed them, and mainly made changes to the helm chart, by making all CLI flags configurable.

shreyas-s-rao avatar May 03 '24 17:05 shreyas-s-rao

/test pull-etcd-druid-e2e-kind

unmarshall avatar May 06 '24 06:05 unmarshall

@ashwani2k @renormalize I have addressed your review comments. PTAL

shreyas-s-rao avatar May 07 '24 09:05 shreyas-s-rao

/retest

shreyas-s-rao avatar May 07 '24 13:05 shreyas-s-rao

@shreyas-s-rao

Deployed etcd-druid using make deploy on my local kind setup.

The multi-node etcd cluster has been created and is now ready.

I then attempted to scale down the StatefulSet replicas to 0, and the operation was successful. However, I was expecting sentinel to reject the StatefulSet update that reduced the replicas to zero. Because by default we are enabling the sentinel when we use make deploy.

Is my understanding correct?

seshachalam-yv avatar May 10 '24 08:05 seshachalam-yv

@shreyas-s-rao

Deployed etcd-druid using make deploy on my local kind setup.

The multi-node etcd cluster has been created and is now ready.

I then attempted to scale down the StatefulSet replicas to 0, and the operation was successful. However, I was expecting sentinel to reject the StatefulSet update that reduced the replicas to zero. Because by default we are enabling the sentinel when we use make deploy.

Is my understanding correct?

Following the previous scenario, I attempted to reconcile the etcd to scale the StatefulSet back up to 3 replicas. Interestingly, it did not scale up, and I observed error logs from etcd-druid:

{"level":"error","ts":"2024-05-10T08:49:03.036Z","logger":"etcd-controller","msg":"failed to sync etcd resource","etcd":{"name":"etcd-test","namespace":"default"},"operation":"reconcileSpec","runID":"283e0ad1-8527-425e-89ba-94adaf6da9fd"," │
│ kind":"StatefulSet","error":"[Operation: Sync, Code: ERR_SYNC_STATEFULSET] StatefulSet.apps \"etcd-test\" is invalid: spec: Forbidden: updates to statefulset spec for fields other than 'replicas', 'ordinals', 'template', 'updateStrategy',  │
│ 'persistentVolumeClaimRetentionPolicy' and 'minReadySeconds' are forbidden","stacktrace":"github.com/gardener/etcd-druid/internal/controller/etcd.(*Reconciler).syncEtcdResources\n\t/go/src/github.com/gardener/etcd-druid/internal/controller │
│ /etcd/reconcile_spec.go:60\ngithub.com/gardener/etcd-druid/internal/controller/etcd.(*Reconciler).triggerReconcileSpecFlow\n\t/go/src/github.com/gardener/etcd-druid/internal/controller/etcd/reconcile_spec.go:27\ngithub.com/gardener/etcd-dr │
│ uid/internal/controller/etcd.(*Reconciler).reconcileSpec\n\t/go/src/github.com/gardener/etcd-druid/internal/controller/etcd/reconciler.go:157\ngithub.com/gardener/etcd-druid/internal/controller/etcd.(*Reconciler).Reconcile\n\t/go/src/githu │
│ b.com/gardener/etcd-druid/internal/controller/etcd/reconciler.go:103\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.g │
│ o:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/contr │
│ oller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs │
│ .k8s.io/[email protected]/pkg/internal/controller/controller.go:227"}       

seshachalam-yv avatar May 10 '24 08:05 seshachalam-yv

@unmarshall: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-etcd-druid-e2e-kind-nondistroless-etcd 1eb40ade43376ce3632d015315ad7d66c2612d1f link true /test pull-etcd-druid-e2e-kind-nondistroless-etcd

Full PR test history. Your PR dashboard. Command help for this repository. Please help us cut down on flakes by linking this test failure to an open flake report or filing a new flake report if you can't find an existing one. Also see our testing guideline for how to avoid and hunt flakes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

gardener-prow[bot] avatar May 14 '24 09:05 gardener-prow[bot]

@seshachalam-yv I have now fixed the behavior of sentinel webhook for handling updates to the statefulset /scale subresource, which you had reported in comment https://github.com/gardener/etcd-druid/pull/777#issuecomment-2104203304, via commit https://github.com/gardener/etcd-druid/pull/777/commits/f75c718cfa6b966328bd6b37843ab1d69d6e9a56.

Essentially, as you can see here, webhook will not be invoked for */scale subresource if an objectSelector is specified. This is because the objectSelector specifies matchLabels, but the scale subresource does not have these labels (since the parent resource has them). Hence, I have enhanced sentinel webhook to handle this special case for statefulsets/scale subresource.

Once you pull the latest changes from the PR and run druid and deploy an etcd resource, you will now see that kubectl scale command on the sts will be denied by sentinel webhook, as expected. PTAL.

shreyas-s-rao avatar May 15 '24 14:05 shreyas-s-rao

@ishan16696 @seshachalam-yv @anveshreddy18 I have addressed your review comments in https://github.com/gardener/etcd-druid/pull/777/commits/53d5fef76966e9ac25beb675dd667319b8f1fbdc, https://github.com/gardener/etcd-druid/pull/777/commits/b9978b4e77ab5ee782f2badbea21c8364f19c947, https://github.com/gardener/etcd-druid/pull/777/commits/d41abe1f53e2763592226b9dceedd279f75f2881 respectively.

shreyas-s-rao avatar May 17 '24 03:05 shreyas-s-rao

/test pull-etcd-druid-e2e-kind

unmarshall avatar May 24 '24 16:05 unmarshall

/retest

shreyas-s-rao avatar May 29 '24 11:05 shreyas-s-rao

/test pull-etcd-druid-e2e-kind-nondistroless-etcd

unmarshall avatar Jun 07 '24 09:06 unmarshall

/test pull-etcd-druid-e2e-kind

unmarshall avatar Jun 09 '24 12:06 unmarshall

/test pull-etcd-druid-e2e-kind

unmarshall avatar Jun 10 '24 04:06 unmarshall

Included changes and release note from https://github.com/gardener/etcd-druid/pull/781

shreyas-s-rao avatar Jun 12 '24 06:06 shreyas-s-rao

/test pull-etcd-druid-e2e-kind

ishan16696 avatar Jun 14 '24 11:06 ishan16696

List of manual tests executed:

TEST NAME Druid Auto-Reconcile Single/Multi Node Backups (provider) Etcd Client TLS Etcd Peer TLS EtcdBR TLS TEST RESULT
Deploy etcd, check reconciliation, hibernate, unhibernate, delete etcd FALSE Single NA FALSE FALSE FALSE TRUE
Deploy etcd, check reconciliation, hibernate, unhibernate, delete etcd FALSE Single NA TRUE TRUE TRUE TRUE
Deploy etcd, check reconciliation, hibernate, unhibernate, delete etcd FALSE Single AWS TRUE TRUE TRUE TRUE
Deploy etcd, check reconciliation, hibernate, unhibernate, delete etcd FALSE Multi NA FALSE FALSE FALSE TRUE
Deploy etcd, check reconciliation, hibernate, unhibernate, delete etcd FALSE Multi NA TRUE TRUE TRUE TRUE
Deploy etcd, check reconciliation, hibernate, unhibernate, delete etcd FALSE Multi AWS TRUE TRUE TRUE TRUE
Deploy etcd, check reconciliation, hibernate, unhibernate, delete etcd FALSE Multi GCP TRUE TRUE TRUE TRUE
Deploy etcd, check reconciliation, hibernate, unhibernate, delete etcd FALSE Multi Azure TRUE TRUE TRUE TRUE
Deploy etcd, check reconciliation, hibernate, unhibernate, delete etcd FALSE Multi Openstack TRUE TRUE TRUE TRUE
Deploy etcd, check reconciliation, hibernate, unhibernate, delete etcd FALSE Multi Local TRUE TRUE TRUE TRUE
Perform etcd spec changes, check if reconciliation triggered FALSE Multi GCP TRUE TRUE TRUE TRUE
Scale-up etcd from single-node non-TLS to multi-node non-TLS, hibernate, unhibernate FALSE Single GCP FALSE FALSE FALSE TRUE
Scale-up etcd from single-node non-TLS to multi-node TLS, hibernate, unhibernate FALSE Single GCP FALSE FALSE FALSE TRUE
Scale-up etcd from single-node TLS to multi-node TLS, hibernate, unhibernate FALSE Single NA TRUE TRUE TRUE TRUE
Upgrade druid from master to #777, check status updates, add reconcile annotation, check reconciliation FALSE Multi GCP TRUE TRUE TRUE TRUE
Deploy etcdcopybackupstask, check success FALSE Multi Local TRUE TRUE TRUE TRUE
Configure compaction with low threshold, populate etcd, check if compaction jobs are triggered and run FALSE Single AWS TRUE TRUE TRUE TRUE
Deploy etcd, check reconciliation, hibernate, unhibernate, delete etcd TRUE Multi GCP TRUE TRUE TRUE TRUE
Perform etcd spec changes, check if reconciliation triggered TRUE Multi GCP TRUE TRUE TRUE TRUE

shreyas-s-rao avatar Jun 23 '24 18:06 shreyas-s-rao

New changes (minor fixes) as part of https://github.com/gardener/etcd-druid/pull/777/commits/4843c75c1f3f4cc8a45cf98639dd41c632b95ead:

  1. Fix scale subresource to use selectorpath as Etcd.spec.selector rather than the deprecated Etcd.status.labelSelector
  2. Add missing delete role to druid
  3. Handle case when auto-reconcile is disabled, and Etcd spec is updated from 1 to 3 replicas, but has not yet been reconciled. During such a case, the expected members in the cluster are only 1 (actual current replicas), and not 3 (spec.replicas). This can be checked by checking whether the etcd observedGeneration matches the generation. This check is now added to Etcd.status.ready check as well as computing Etcd.status.members and consequently the AllMembersReady condition.
  4. Minor fix in skaffold-based helm deployment, to set certain compaction-related flags only upon profile activation, and not always.

shreyas-s-rao avatar Jun 23 '24 18:06 shreyas-s-rao

/retest

shreyas-s-rao avatar Jun 24 '24 07:06 shreyas-s-rao

/retest

shreyas-s-rao avatar Jun 24 '24 08:06 shreyas-s-rao

I ran the e2e tests locally and these pass for me.

unmarshall avatar Jun 24 '24 08:06 unmarshall