cluster-api Generated YAML definitions for clusters are idempotent

What would you like to be added (User Story)?

As a cluster operator, I'd like to be able to use the YAML created by clusterctl generate cluster to be idempotent so that I can easily ensure my cluster infrastructure as code stored in git matches what is deployed in the real environment.

Detailed Description

As a cluster operator, I'd like to be able to use the YAML created by clusterctl generate cluster to be idempotent so that I can ensure my cluster infrastructure as code stored in git matches what is deployed in the real environment.

Today this may or may not work, depending on if anything has changed on the cluster. When it doesn't work, you get back an error message. A user is not sure why this didn't work since it might be the same cluster definition they originally applied.

This problem can be exacerbated when you're using CAPI + GitOps in a reconciliation loop. It is cumbersome/tedious to do an actual backup of the cluster after the initial generated manifest has been applied and then extract the needed raw K8s YAML which would/should be supported in an idempotent way.

Anything else you would like to add?

No response

Label(s) to be applied

/kind feature

May 16 '23 01:05 dtzar

This issue is currently awaiting triage.

If CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

May 16 '23 01:05 k8s-ci-robot

In which way are the YAMLs not idempotent today? Can you provide a concrete example?

May 16 '23 04:05 sbueringer

Also, please provide the error you are referring to in the issue description

/triage needs-information

May 18 '23 16:05 fabriziopandini

I have a CAPZ definition against a k8s for docker-desktop management cluster, then I reset the management cluster, installed same exact version of the management cluster and secrets, re-applied the same exact definition and I get these errors:

cluster.cluster.x-k8s.io/ascapz unchanged
machinepool.cluster.x-k8s.io/pool0 created
machinepool.cluster.x-k8s.io/pool1 created
azureclusteridentity.infrastructure.cluster.x-k8s.io/asc-cluster-identity unchanged
Error from server (InternalError): error when creating "cluster.yaml": Internal error occurred: failed calling webhook "default.azuremanagedcontrolplanes.infrastructure.cluster.x-k8s.io": failed to call webhook: Post "https://capz-webhook-service.capz-system.svc:443/mutate-infrastructure-cluster-x-k8s-io-v1beta1-azuremanagedcontrolplane?timeout=10s": dial tcp 10.104.156.200:443: connect: connection refused       
Error from server (InternalError): error when creating "cluster.yaml": Internal error occurred: failed calling webhook "validation.azuremanagedclusters.infrastructure.cluster.x-k8s.io": failed to call webhook: Post "https://capz-webhook-service.capz-system.svc:443/validate-infrastructure-cluster-x-k8s-io-v1beta1-azuremanagedcluster?timeout=10s": dial tcp 10.104.156.200:443: connect: connection refused
Error from server (InternalError): error when creating "cluster.yaml": Internal error occurred: failed calling webhook "default.azuremanagedmachinepools.infrastructure.cluster.x-k8s.io": failed to call webhook: Post "https://capz-webhook-service.capz-system.svc:443/mutate-infrastructure-cluster-x-k8s-io-v1beta1-azuremanagedmachinepool?timeout=10s": dial tcp 10.104.156.200:443: connect: connection refused
Error from server (InternalError): error when creating "cluster.yaml": Internal error occurred: failed calling webhook "default.azuremanagedmachinepools.infrastructure.cluster.x-k8s.io": failed to call webhook: Post "https://capz-webhook-service.capz-system.svc:443/mutate-infrastructure-cluster-x-k8s-io-v1beta1-azuremanagedmachinepool?timeout=10s": dial tcp 10.104.156.200:443: connect: connection refused

May 22 '23 21:05 dtzar

Looks to me like the CAPZ controller is not up/reachable

I don't see a correlation with the YAML we generate

May 23 '23 05:05 sbueringer

I agree with @sbueringer that the output above simply looks like the webhooks were unavailable at that time.

That said, in terms of idempotency, defaulting webhooks are capable of mutating yaml upon arrival, which can render the original (incomplete) yaml invalid if those default-populated fields also have a corresponding immutability check. (Or if they don't then, you're still in a bad state where you're changing the resource the next time you apply it (revert the default back to empty).

Is there a canonical way at runtime to communicate to the webhook facility "don't apply any defaults". This way we can ensure that if defaults are a part of populating required fields, that the source yaml, which may not have those required fields populated, will be rejected. This way a gitops-enabled source of truth can be updated to include a completely populated yaml resource, and be reused idempotently.

May 25 '23 17:05 jackfrancis

@dtzar something like this might work:

https://github.com/kubernetes-sigs/cluster-api/pull/8750

I'm not familiar enough with the mutating admission webhook flows, but this would depend upon our being able to serialize things in this way:

defaulting webhook
create webhook

The reason I say that is because (in CAPZ at least) there seems to be a sort of composition between create + default that together form a "validate new objects" flow. If we don't get the order correct, and we simply enable skipping the default flow after we've done create validation, we might be breaking our behavioral assumptions about how we're validating object input during create. In other words, there are existing defaulting webhooks that in part serve to enforce create validations (they turn empty input into valid input; if the empty values are in fact not valid, we want to validate that after the defaulting webhook flow so that in case we skip it, we are able to properly invalidate the object during admission).

Hope that makes sense.

cc @CecileRobertMichon

May 25 '23 23:05 jackfrancis

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jan 21 '24 07:01 k8s-triage-robot

I believe this to still be very valid and an important one if people want to recover their clusters 100% from git source.

Jan 22 '24 17:01 dtzar

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Feb 21 '24 17:02 k8s-triage-robot

/remove-lifecycle rotten

Feb 21 '24 18:02 dtzar

/close

By looking at the discussion above, it seems to me this isn't a Cluster API specific issue, but it is a more general API design / API machinery discussion.

Also, the output of clusterctl generate cluster is a consequence of the template published by each provider, and Cluster API (clusterctl more specifically) only defines few (lazy) guidelines.

Question like:

defaulting webhooks are capable of mutating yaml upon arrival, which can render the original (incomplete) yaml invalid if those default-populated fields also have a corresponding immutability check. (Or if they don't then, you're still in a bad state where you're changing the resource the next time you apply it (revert the default back to empty).

Should be probably addressed to API machinery folks to get a definitive answer.

My two cents: given that defaulting always runs before validating, if defaulting the original (incomplete) yaml always leads to the same value (is deterministic), then the corresponding immutability check should pass But happy to dig into a concrete example so I can learn more.

Mar 29 '24 21:03 fabriziopandini

@fabriziopandini: Closing this issue.

In response to this:

/close

By looking at the discussion above, it seems to me this isn't a Cluster API specific issue, but it is a more general API design / API machinery discussion.

Also, the output of clusterctl generate cluster is a consequence of the template published by each provider, and Cluster API (clusterctl more specifically) only defines few (lazy) guidalines.

Question like:

defaulting webhooks are capable of mutating yaml upon arrival, which can render the original (incomplete) yaml invalid if those default-populated fields also have a corresponding immutability check. (Or if they don't then, you're still in a bad state where you're changing the resource the next time you apply it (revert the default back to empty).

Should be probably addressed to API machinery folks to get a definitive answer.

My two cents: given that defaulting always runs before validating, if defaulting the original (incomplete) always leads to the same value (is deterministic), then the corresponding immutability check should pass But happy to dig into a concrete example so I can learn more.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Mar 29 '24 21:03 k8s-ci-robot