faros
faros copied to clipboard
Resource Deletion Concerns
We're concerned with resources being deleted in production environments. We consider the current Faros approach a risk and therefore have introduced proposals that would mitigate this risk.
This issue summarises those concerns and proposals. I'm looking forward to your take on those and we're happy to work on those proposals.
- Deleting
GitTrack
resource deletes all resources managed by Faros: https://github.com/pusher/faros/issues/98 -
--cascade=false
not working with Faros: https://github.com/pusher/faros/issues/99 - Potential misconfiguration leading to all resources managed by Faros being deleted: https://github.com/pusher/faros/issues/100
Our previous issue https://github.com/pusher/faros/issues/92 already hinted at some of those concerns, but addresses the wrong issue.
As for the proposal to implement a Webhook Admission controller to prevent modification of the GitTrack
resource mentioned in #92, we believe that this is only addressing a part of those concerns. Changes in the repository leading to resources being deleted cannot be prevented this way. Also, it would make intentional changes to the GitTrack
harder.
Any other thoughts on this?
Hi @sebastianroesch, thanks for raising this issue.
This is something we have been thinking about internally for a few weeks since we had an incident in which the GitTrack managing our kube-system
namespace on one of our test clusters was accidentally overwritten and, well... everything got deleted 🤦♂️
Forgive me if I have misunderstood, but #98 and #100 aim to solve the same problem as far as I can tell?
When designing Faros, we had intended for Faros to be the owner of each of the resources and therefore, if the GitTrack were to be deleted, the child resources should also be deleted. There are however, occasions where you may not want this to happen as you have raised.
I like the idea of #100 personally and think this would probably be the best way to go about solving this issue.
We could add the DeleteStrategy
(I think I would maybe rename this to DeletionStrategy
as it fits with fields like DeletionTimestamp
, WDYT?) field to the GitTracks as you've described.
There are three cases to handle then:
-
cleanup (default)
: No change to the existing behaviour.OwnerReferences
and the GarbageCollector should handle all deletion. -
resource-strategy
: TheDeleteStrategy
and aFinalizer
would need to be propagated onto theGitTrackObjects
andClusterGitTrackObjects
when they are created/updated. Then, when aGitTrack
is deleted, the GarbageCollector will mark theGTO
andCGTO
s for deletion, theGitTrackObjectController
would check for this on each reconcile and handle the deletion logic reading an annotation from the child resources data. If the option is to not delete, then the controller removes theOwnerReference
from the child object and then removes it'sFinalizer
allowing the GTO/CGTO to be GC'ed. -
never
: Similar to theresource-strategy
, except we just always cleanup theOwnerReference
andFinalizer
What do you think of this plan? Does this make sense?
Unfortunately, there is a kind of prerequisite to this work being done. This would be a change in the API of the faros.pusher.com
group, so we would need to create the v1beta1
group (something that has been a long time coming). Before we can do that we need to work out the migration path, I'm currently kind of waiting on https://github.com/kubernetes-sigs/controller-runtime/pull/351 to make progress before tackling the problem of creating the new API group.
I'm semi tempted to make a release branch that we can continue to release on while we work on the migration, allowing us to start work on these breaking changes but leave the migration logic until later, would that work do you think?
I like DeletionStrategy
as the name too.
I think the idea of separating #98 and #100 is just to iterate and get the simplest change possible into master
as quickly as possible. For example, if it's simple to implement a DeletionStrategy
with just cleanup
and never
we could start there. We could maybe reword them a bit once we agree on an approach.
Maybe I'm missing something regarding https://github.com/kubernetes-sigs/controller-runtime/pull/351, but could we just make DeletionStrategy
an additive feature and therefore not require us to change the API version? I was under the (possibly wrong) impression that k8s API versions are kind of like the Major version in SemVer which only require changing when the change is breaking.
I'd like to avoid branching further. I'm a big fan of keeping one code line as much as possible. I realize that this is tricky with larger changes as well as the async nature of OSS. I still think we should try and break this down into smaller changes if possible and keep them in master
if we can.
We could maybe reword them a bit once we agree on an approach.
Naming is incredibly hard! Do we have any alternate suggestions for these names? Do they need changing? Are cleanup
and never
obvious as to what they should be doing? I'm gonna have a think on this 🤔
Maybe I'm missing something regarding kubernetes-sigs/controller-runtime#351, but could we just make DeletionStrategy an additive feature and therefore not require us to change the API version? I was under the (possibly wrong) impression that k8s API versions are kind of like the Major version in SemVer which only require changing when the change is breaking.
Yeah you're right here, I'm being a bit overly cautious! I was wondering what would happen if you start adding random fields to objects when they aren't in the OpenAPI schema, turns out, nothing! It just accepts them anyway. So yeah, we can totally just add this field to the existing API group, ignore what I said earlier 🙈
I'd like to avoid branching further. I'm a big fan of keeping one code line as much as possible. I realize that this is tricky with larger changes as well as the async nature of OSS. I still think we should try and break this down into smaller changes if possible and keep them in master if we can.
Yep, I agree completely, was just trying to think of a nice way to workaround the issue of waiting for the API group to bump but we don't need to so Ignore that.
Thanks for the sanity check btw @tshak !
Thanks @JoelSpeed for following up on this.
#98 is concerned with the cascade-deletion of all resources when the GitTrack
object is deleted while #100 is concerned with deleting (or not deleting) resources that are no longer in Git. If I understood correctly, the proposed DeletionStrategy
does not prevent unintended cascade-delete of resources linked to GitTrackObject
s and GitTrack
via ownerReference
.
We could add the DeleteStrategy (I think I would maybe rename this to DeletionStrategy as it fits with fields like DeletionTimestamp, WDYT?) field to the GitTracks as you've described.
I like the name DeletionStrategy
and the other proposed names.
The DeleteStrategy and a Finalizer would need to be propagated onto the GitTrackObjects and ClusterGitTrackObjects when they are created/updated. Then, when a GitTrack is deleted, the GarbageCollector will mark the GTO and CGTOs for deletion, the GitTrackObjectController would check for this on each reconcile and handle the deletion logic reading an annotation from the child resources data.
Maybe I'm not understanding correctly how Finalizers and GC works. The resource-strategy
was intended as a way to be explicit about deletes. In cases where it's not acceptable that an empty folder or misconfigured GitTrack
delete all resources, this strategy would only delete the GitTrackObject
s of resources that have a specific annotation in Git. How does this relate to the GitTrack
being deleted? If the GitTrack
is gone, I assume Faros will stop creating, updating and deleting resources completely? Or does this handle a different case than I have in mind?
never: Similar to the resource-strategy, except we just always cleanup the OwnerReference and Finalizer
My impression is that this makes sense, except for cases where, as proposed, an ownerReference
hasn't been created in the first place.
#98 is concerned with the cascade-deletion of all resources when the GitTrack object is deleted while #100 is concerned with deleting (or not deleting) resources that are no longer in Git.
So I think we can kill 2 birds with 1 stone here 😄
Maybe I'm not understanding correctly how Finalizers and GC works.
The reason I think we can Kill two birds with one stone is because of the Finalizers! So, the theory here is that we make our own cascade using the Finalizer
and the DeletionStrategy
, whether you --cascade=false
or not, the Finalizer
on the GTO/CGTO
s prevents them from being deleted until the deletion logic is handled by Faros.
If cascade=true
, then deleting the GitTrack
will cause the GC
to attempt to delete all GTO/CGTO
s, the Finalizer
blocks the delete until we handle removing OwnerReferences
from the child objects meant to be kept. Once the OwnerReferences
have been removed, remove the Finalizer
allowing the GTO/CGTO
to be deleted, this in turn would cause any resource owned by the GTO/CGTO
to be deleted, but since we removed the ownership relation, this propagation of deletion is halted.
Does that make any more sense?
My impression is that this makes sense, except for cases where, as proposed, an ownerReference hasn't been created in the first place.
Where is this proposed? I don't think we should be modifying the existing behaviour of adding OwnerReferences
to objects meant to be managed. Faros's design relies heavily on the OwnerReference
relations to make lots of bits internally work (all the resource watching relies on OwnerReferences
for example)
I finally found the time to process this, and I think it's a good proposal. This should address all our concerns. Thanks @JoelSpeed for the clarification.