community Semantics for destructive operations design proposal

Issue #, if available: https://github.com/aws-controllers-k8s/community/issues/82

Description of changes: Draft proposal for mechanisms to protect users against accidental deletion of ACK and/or AWS resources

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Jan 28 '22 00:01 RedbackThomson

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: RedbackThomson

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [RedbackThomson]

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Jan 28 '22 00:01 ack-bot

@jaypipes

I am opposed to the entire "deletion protection" feature, for reasons stated inline in my comments.

I actually do agree with this. I wanted to write this proposal to get critical feedback like this. As I was writing the proposal I felt that we would be overengineering functionality. However, I still felt it was important to explore the technical challenges involved just so that we could all see the pros and cons of the function.

I will be removing the "Deletion protection" from this proposal - I might add an appendix where I mention this discussion to keep it for posterity. I will pivot this proposal into discussing only the "AWS Removal Policy" functionality.

On that front:

For the reverse of adoption action, I would prefer to have a similarly explicit action, essentially an "unadopt" or "unmanage" action

The reason we chose to have a separate CRD for adoption was mainly because we could not fit enough information into the existing CRD (or metadata) of the ACK resources, or we would have had to significantly redesign the optional/required fields. Further, the adopted resource CR acts as a marker for, and stores the state for, the adoption reconciler to create a new ACK resource which the controller then reconciles as if it were previously created.

I don't believe the removal policy needs anything as sophisticated as this. This proposal only addresses whether the controller will or will not call Delete* in the SDK before it deletes the finalizer. We do not need to modify any elements of the spec or metadata (apart from this single annotation) on the ACK resource. The ACK resource is also immediately deleted after the delete operation, so having a remaining "Unadoption" custom resource would essentially only act as a log.

Your kubectl unadopt alias could easily be made, and would be as simple as:

kubectl annotate bucket my-bucket services.k8s.aws/deletion-policy=retain
kubectl delete bucket my-bucket

Jan 28 '22 22:01 RedbackThomson

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close. Provide feedback via https://github.com/aws-controllers-k8s/community. /lifecycle stale

May 17 '22 23:05 ack-bot

/remove-lifecycle stale

May 18 '22 19:05 RedbackThomson

Having read the proposal and the comments, I can see that there are some strong opinions and fierce opposition to this feature, so perhaps the point is moot. Nevertheless, I'd like to weigh in for some additional perspective since my team is starting to evaluate ACK and this was one area of concern.

Many teams, ours included, use k8s only for stateless applications. Losing a deployment/service/configmap due to an erroneous merge is easy to rollback and no permanent damage is done. Losing an RDS instance or S3 bucket, OTOH, could be a disaster from which there is no recovery. For that reason, some kind of protection would be welcomed.

Additionally, when using managed Kubernetes services like EKS in which the customer has no control plane access, there's a certain nervous tick that something could go wrong in etcd completely outside of customer control. What if a weird bug in EKS caused objects in etcd to disappear and as a result the underlying AWS resources are deleted? Some of the protections mentioned in the proposal might protect from this eventuality (e.g. a policy on the controller). Perhaps this is highly unlikely/nigh impossible, but is it fair to write off this worry entirely? We've already experienced unexplainable behavior with etcd on EKS (not this specific thing) so I think the concern is well founded.

"Unadopting" resources is only useful when removal is planned. It's the unplanned errors that I'd worry about.

I do hope this proposal will remain open for others to offer still more perspectives.

Jul 13 '22 23:07 bwhaley

What if a weird bug in EKS caused objects in etcd to disappear and as a result the underlying AWS resources are deleted

Although unlikely, I understand your point. The controller is designed to delete resources when it sees the deletionTimestamp property is set to a non-nil value. Disappearing from etcd is less of a concern, but some other tool setting this value on behalf on an unwilling user is a fair problem.

I do still agree with @jaypipes with respect to not having deletion protection. We would essentially be recreating RBAC in a very naive way, introducing yet another mechanism for people to configure permissions on top of K8s RBAC and AWS IAM. Those two systems are far better suited for that style of fine-grained access control. If you really, really wanted to make sure your production systems can't be deleted by an ACK controller, you could even modify the IAM role associated with the controller and deny all Delete:* permissions.

However, the proposal for adding a retain setting on resources I think still stands. There may be situations where you'd like to migrate stateful applications between ACK-enabled clusters. Setting retain on each of your stateful application infrastructure elements, deleting them from the cluster and then adopting them into the new cluster would greatly reduce the chance of accidental deletion. Furthermore, proactively setting retain on stateful resources would protect those resources from being accidentally deleted by someone who doesn't realise they're in the wrong cluster context.

This proposal will remain open until either someone decides to act on it - doesn't have to be the ACK core team, this feature is open for anyone to take a crack at - or until it is dismissed for one reason or another.

Jul 14 '22 19:07 RedbackThomson

We would essentially be recreating RBAC in a very naive way, introducing yet another mechanism for people to configure permissions on top of K8s RBAC and AWS IAM.

AFAICT this proposal or any resource protection feature doesn't resemble RBAC in the least. There isn't anything about access control or permissions. Rather, this is about how to decide whether a resource should be deleted.

However, the proposal for adding a retain setting on resources I think still stands

Interesting. So you're suggesting something like (though not exactly) the following:

s3.services.k8s.aws/deletion-policy: retain

for any individual resource, but not for an entire controller or namespace? Am I understanding your point correctly?

If so, that's great, but that smells like a deletion policy to me.

Jul 14 '22 20:07 bwhaley

AFAICT this proposal or any resource protection feature doesn't resemble RBAC in the least. There isn't anything about access control or permissions. Rather, this is about how to decide whether a resource should be deleted.

Oh, actually, I removed that from the proposal altogether. That's what the majority of the disagreement was about - designing an annotation that prevents deletion of a given resource. Nevermind, then.

for any individual resource, but not for an entire controller or namespace

Correct. This is the current intention for this version of the proposal.

smells like a deletion policy to me

Well it was borrowed from the CloudFormation DeletionPolicy resource property, so that's not a coincidence.

Jul 14 '22 20:07 RedbackThomson

Got it! I think I was missing all the RBAC context. Now it makes more sense.

for any individual resource, but not for an entire controller or namespace

Correct. This is the current intention for this version of the proposal.

Nice, seems really clean then.

Well it was borrowed from the CloudFormation DeletionPolicy resource property, so that's not a coincidence.

FWIW, Terraform uses the term "prevent destroy" which also seems apropos to this proposal.

Jul 14 '22 20:07 bwhaley

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close. Provide feedback via https://github.com/aws-controllers-k8s/community. /lifecycle stale

Oct 12 '22 22:10 eks-bot

/lifecycle frozen

Oct 27 '22 19:10 a-hilaly

@A-Hilaly: The lifecycle/frozen label cannot be applied to Pull Requests.

In response to this:

/lifecycle frozen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Oct 27 '22 19:10 ack-bot

/remove-lifecycle stale

Oct 27 '22 19:10 a-hilaly

/lgtm

Jan 09 '23 19:01 jljaco

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jljaco, RedbackThomson

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [RedbackThomson,jljaco]

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Jan 09 '23 19:01 ack-bot

community community copied to clipboard

Semantics for destructive operations design proposal

community
community copied to clipboard