Generated YAML definitions for clusters are idempotent
What would you like to be added (User Story)?
As a cluster operator, I'd like to be able to use the YAML created by clusterctl generate cluster to be idempotent so that I can easily ensure my cluster infrastructure as code stored in git matches what is deployed in the real environment.
Detailed Description
As a cluster operator, I'd like to be able to use the YAML created by clusterctl generate cluster to be idempotent so that I can ensure my cluster infrastructure as code stored in git matches what is deployed in the real environment.
Today this may or may not work, depending on if anything has changed on the cluster. When it doesn't work, you get back an error message. A user is not sure why this didn't work since it might be the same cluster definition they originally applied.
This problem can be exacerbated when you're using CAPI + GitOps in a reconciliation loop. It is cumbersome/tedious to do an actual backup of the cluster after the initial generated manifest has been applied and then extract the needed raw K8s YAML which would/should be supported in an idempotent way.
Anything else you would like to add?
No response
Label(s) to be applied
/kind feature
This issue is currently awaiting triage.
If CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.
The triage/accepted label can be added by org members by writing /triage accepted in a comment.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
In which way are the YAMLs not idempotent today? Can you provide a concrete example?
Also, please provide the error you are referring to in the issue description
/triage needs-information
I have a CAPZ definition against a k8s for docker-desktop management cluster, then I reset the management cluster, installed same exact version of the management cluster and secrets, re-applied the same exact definition and I get these errors:
cluster.cluster.x-k8s.io/ascapz unchanged
machinepool.cluster.x-k8s.io/pool0 created
machinepool.cluster.x-k8s.io/pool1 created
azureclusteridentity.infrastructure.cluster.x-k8s.io/asc-cluster-identity unchanged
Error from server (InternalError): error when creating "cluster.yaml": Internal error occurred: failed calling webhook "default.azuremanagedcontrolplanes.infrastructure.cluster.x-k8s.io": failed to call webhook: Post "https://capz-webhook-service.capz-system.svc:443/mutate-infrastructure-cluster-x-k8s-io-v1beta1-azuremanagedcontrolplane?timeout=10s": dial tcp 10.104.156.200:443: connect: connection refused
Error from server (InternalError): error when creating "cluster.yaml": Internal error occurred: failed calling webhook "validation.azuremanagedclusters.infrastructure.cluster.x-k8s.io": failed to call webhook: Post "https://capz-webhook-service.capz-system.svc:443/validate-infrastructure-cluster-x-k8s-io-v1beta1-azuremanagedcluster?timeout=10s": dial tcp 10.104.156.200:443: connect: connection refused
Error from server (InternalError): error when creating "cluster.yaml": Internal error occurred: failed calling webhook "default.azuremanagedmachinepools.infrastructure.cluster.x-k8s.io": failed to call webhook: Post "https://capz-webhook-service.capz-system.svc:443/mutate-infrastructure-cluster-x-k8s-io-v1beta1-azuremanagedmachinepool?timeout=10s": dial tcp 10.104.156.200:443: connect: connection refused
Error from server (InternalError): error when creating "cluster.yaml": Internal error occurred: failed calling webhook "default.azuremanagedmachinepools.infrastructure.cluster.x-k8s.io": failed to call webhook: Post "https://capz-webhook-service.capz-system.svc:443/mutate-infrastructure-cluster-x-k8s-io-v1beta1-azuremanagedmachinepool?timeout=10s": dial tcp 10.104.156.200:443: connect: connection refused
Looks to me like the CAPZ controller is not up/reachable
I don't see a correlation with the YAML we generate
I agree with @sbueringer that the output above simply looks like the webhooks were unavailable at that time.
That said, in terms of idempotency, defaulting webhooks are capable of mutating yaml upon arrival, which can render the original (incomplete) yaml invalid if those default-populated fields also have a corresponding immutability check. (Or if they don't then, you're still in a bad state where you're changing the resource the next time you apply it (revert the default back to empty).
Is there a canonical way at runtime to communicate to the webhook facility "don't apply any defaults". This way we can ensure that if defaults are a part of populating required fields, that the source yaml, which may not have those required fields populated, will be rejected. This way a gitops-enabled source of truth can be updated to include a completely populated yaml resource, and be reused idempotently.
@dtzar something like this might work:
https://github.com/kubernetes-sigs/cluster-api/pull/8750
I'm not familiar enough with the mutating admission webhook flows, but this would depend upon our being able to serialize things in this way:
- defaulting webhook
- create webhook
The reason I say that is because (in CAPZ at least) there seems to be a sort of composition between create + default that together form a "validate new objects" flow. If we don't get the order correct, and we simply enable skipping the default flow after we've done create validation, we might be breaking our behavioral assumptions about how we're validating object input during create. In other words, there are existing defaulting webhooks that in part serve to enforce create validations (they turn empty input into valid input; if the empty values are in fact not valid, we want to validate that after the defaulting webhook flow so that in case we skip it, we are able to properly invalidate the object during admission).
Hope that makes sense.
cc @CecileRobertMichon
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
I believe this to still be very valid and an important one if people want to recover their clusters 100% from git source.
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/remove-lifecycle rotten
/close
By looking at the discussion above, it seems to me this isn't a Cluster API specific issue, but it is a more general API design / API machinery discussion.
Also, the output of clusterctl generate cluster is a consequence of the template published by each provider, and Cluster API (clusterctl more specifically) only defines few (lazy) guidelines.
Question like:
defaulting webhooks are capable of mutating yaml upon arrival, which can render the original (incomplete) yaml invalid if those default-populated fields also have a corresponding immutability check. (Or if they don't then, you're still in a bad state where you're changing the resource the next time you apply it (revert the default back to empty).
Should be probably addressed to API machinery folks to get a definitive answer.
My two cents: given that defaulting always runs before validating, if defaulting the original (incomplete) yaml always leads to the same value (is deterministic), then the corresponding immutability check should pass But happy to dig into a concrete example so I can learn more.
@fabriziopandini: Closing this issue.
In response to this:
/close
By looking at the discussion above, it seems to me this isn't a Cluster API specific issue, but it is a more general API design / API machinery discussion.
Also, the output of clusterctl generate cluster is a consequence of the template published by each provider, and Cluster API (clusterctl more specifically) only defines few (lazy) guidalines.
Question like:
defaulting webhooks are capable of mutating yaml upon arrival, which can render the original (incomplete) yaml invalid if those default-populated fields also have a corresponding immutability check. (Or if they don't then, you're still in a bad state where you're changing the resource the next time you apply it (revert the default back to empty).
Should be probably addressed to API machinery folks to get a definitive answer.
My two cents: given that defaulting always runs before validating, if defaulting the original (incomplete) always leads to the same value (is deterministic), then the corresponding immutability check should pass But happy to dig into a concrete example so I can learn more.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.