eks-anywhere icon indicating copy to clipboard operation
eks-anywhere copied to clipboard

Manually updating state for `in_use` Tinkerbell Hardware objects cause cluster to go down

Open pokearu opened this issue 3 years ago • 3 comments

Description

If users manually modify tinkerbell Hardware objects that are part of a cluster, i.e. have a .Spec.Metadata.State='in_use' and update the state to provisioning or any other value, CAPT re-provisions the hardware. This results in EKS-A cluster nodes getting powered off and the cluster goes down.

Possible Fix

Having a validating webhook on CAPT or Tink to prevent Hardware objects that are part of cluster from being modified manually. Hardware object state should be changed only when cluster is being deleted.

pokearu avatar Jul 18 '22 19:07 pokearu

What makes this problem any different to a database with incorrectly configured RBAC?

Are you able to elaborate on how you can distinguish between a "manual edit" and non-manual?

chrisdoherty4 avatar Jul 22 '22 14:07 chrisdoherty4

An example of manual edit would be when a user applies hardware objects that are generated from eks-a cli generate command, but these hardware are already part of a cluster. It would set the State back to provisioning since that is the default value we set.

A non-manual edit, is something like running upgrade from eks-a cli. Here we would have validations to ensure we do not modify hardware that are part of cluster even if the CSV file given as input has them.

This is in a way different from incorrect RBAC because CAPT is heavily reliant on the .State value and we have no validations/restrictions to prevent its modification.

pokearu avatar Jul 26 '22 19:07 pokearu

An example of manual edit would be when a user applies hardware objects that are generated from eks-a cli generate command, but these hardware are already part of a cluster. It would set the State back to provisioning since that is the default value we set.

I think that's unsupported behavior. We can't be expected to have hardware in some semi-in-use state submitted and know what to do with it. CAPI even calls this kind of thing out as a non-goal for good reason - perhaps we should do the same.

A non-manual edit, is something like running upgrade from eks-a cli. Here we would have validations to ensure we do not modify hardware that are part of cluster even if the CSV file given as input has them.

I think I'd expect Kubernetes to reject submission of the Hardware object given its name wouldn't be unique within the given namespace.


The reason I mentioned RBAC is because I don't think this is quite as clear cut as making the object, or a subset of its fields, immutable based on some other field already set. I think the client identity matters also.

To elaborate, CAPT needs to be able to update parts of the Hardware object (and other services) where as an end-user like an operator probably shouldn't be able to do that. The existing state seems like it would matter but its not the only data point.

The annoying part is that operators can be admins and grant themselves perms to update the Hardware object so it makes it very tricky to explicitly block.

We might want to explore other Kubernetes mechanisms and see if we can come up with a set of options.

chrisdoherty4 avatar Aug 19 '22 17:08 chrisdoherty4

Based on discussions with the community, we have decided to not make Hardware CR immutable by default. We are instead going to add blocks/validations on eksctl anywhere to prevent repeated hardware inputs.

pokearu avatar Sep 26 '22 17:09 pokearu