CRD validation: block CRD deletion if CRs are still present
In Rukpak, when a bundle deployment exists that has installed CRDs and the cluster has CRs for that CRD, there is nothing currently preventing or warning a user that deletion of a bundle deployment will result in those CRDs being deleted, which cascades to the CRs being deleted, which cascades to user workloads being deleted.
This is even more problematic in future use cases where bundle deployments might exist as a result of dependency resolution. Imagine this scenario:
- Operator A depends on Operator B
- ClusterUser1 says "install Operator A"
- The system resolves 1 and 2 and installs Operators A and B
- ClusterUser1 happily interacting with Operator A's APIs
- ClusterUser2 (who may or may not be aware of how A and B were installed) starts using B's APIs.
- The ClusterUser1 decides "Operator A is no longer required" and tells the cluster as much
- The cluster re-resolves and says "I don't need Operator A, and since Operator B is no longer required, I can get rid of that too"
- The cluster deletes Operator B's bundle deployment, which deletes Operator B's CRDs, which deletes those CRs.
- ClusterUser2 is like 🤯 🤯 😭
One issue with the proposed approach is that the validatingWebhookConfiguration would only prevent the CRD from being removed, the BundleDeployment and all other resources it defines will be deleted. There would be no controller watching the CRD and its associated CRs.
We should consider how we can identify the scenario above prior to install/uninstalling bundles with RukPak.
One other tidbit: the bundle deployment deletion triggers a helm uninstall of the bundle contents under the hood. If that uninstall fails, I think the bundle deployment controller will attempt to put things back to the way they were before the uninstall started. Maybe that helps?
@awgreene Do you remember why there's an urgent priority with this? This seems like a problem, but not something we need to immediately tackle.
Backfilling to v0.10.0 for now. This is something I could see being pretty valueable over time. We can continue to triage this in the future.
This seems like it'll need more thought and design consideration, so I'm going to move this to the backlog and demote to important-longterm.
This issue has become stale because it has been open 60 days with no activity. The maintainers of this repo will remove this label during issue triage or it will be removed automatically after an update. Adding the lifecycle/frozen label will cause this issue to ignore lifecycle events.