cluster-api icon indicating copy to clipboard operation
cluster-api copied to clipboard

🐛 Improve handling of topology orphaned objects

Open fabriziopandini opened this issue 1 year ago • 5 comments

What this PR does / why we need it: This PR fixes https://github.com/kubernetes-sigs/cluster-api/issues/10275 and improves how the topology controller handles referenced objects in case of errors, and more specifically

  • if InfrastructureCluster is created, but ControlPlane creation fails, InfrastructureCluster is tracked (the issue)
  • if infrastructureMachineTemplate is created, but ControlPlane creation fails, infrastructureMachineTemplate is cleaned up
  • if infrastructureMachineTemplate is created, but an error happens before MD is created, infrastructureMachineTemplate is cleaned up
  • if bootstrapTemplate is created, but an error happens before MD is created, bootstrapTemplate is cleaned up
  • if infrastructureMP is created, but an error happens before MP is created, infrastructureMP is cleaned up
  • if bootstrapConfig is created, but an error happens before MP is created, bootstrapConfig is cleaned up

I will keep the PR in WIP while I run some additional test

Which issue(s) this PR fixes: Fixes https://github.com/kubernetes-sigs/cluster-api/issues/10275

/area clusterclass

/cc @sbueringer @chrischdi

fabriziopandini avatar Mar 18 '24 20:03 fabriziopandini

/test pull-cluster-api-e2e-main

fabriziopandini avatar Mar 18 '24 20:03 fabriziopandini

@fabriziopandini : would it be handy to perhaps use controllerutil.OperationResult instead of bools (which can maybe be helpful for other things down the line?)

mnaser avatar Mar 18 '24 20:03 mnaser

would it be handy to perhaps use controllerutil.OperationResult instead of bools (which can maybe be helpful for other things down the line?)

controllerutil.OperationResult is not an exact match for reconcileReferencedTemplate, because one possible outcome is a template rotation.

also, this is an internal API, we can eventually refactor it again if we need more things down the line.

fabriziopandini avatar Mar 18 '24 21:03 fabriziopandini

Tested reproducing the error with CAPD, no duplicated InfrastructureClusters are created with the fix. Also tested that after the cluster class patch is fixed, cluster provisioning restarts as expected

fabriziopandini avatar Mar 19 '24 12:03 fabriziopandini

LGTM label has been added.

Git tree hash: 72bfbb3466262c0662d681ee36619380c004577d

k8s-ci-robot avatar Mar 26 '24 15:03 k8s-ci-robot

Thx!

/lgtm /approve

sbueringer avatar Mar 27 '24 05:03 sbueringer

LGTM label has been added.

Git tree hash: ecf3b2db4dda4af199608c9a2863001aab09baea

k8s-ci-robot avatar Mar 27 '24 05:03 k8s-ci-robot

Let's cherry-pick (at least into 1.6)

/cherry-pick release-1.6

sbueringer avatar Mar 27 '24 05:03 sbueringer

@sbueringer: once the present PR merges, I will cherry-pick it on top of release-1.6 in a new PR and assign it to you.

In response to this:

Let's cherry-pick (at least into 1.6)

/cherry-pick release-1.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

/cherry-pick release-1.5

sbueringer avatar Mar 27 '24 05:03 sbueringer

@sbueringer: once the present PR merges, I will cherry-pick it on top of release-1.5 in a new PR and assign it to you.

In response to this:

/cherry-pick release-1.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sbueringer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Mar 27 '24 05:03 k8s-ci-robot

@sbueringer: #10277 failed to apply on top of branch "release-1.6":

Applying: Avoid leaving orphaned InfrastructureCluster when create control plane fails
Applying: Best effort cleanup of referenced templates/objects
Using index info to reconstruct a base tree...
M	internal/controllers/topology/cluster/desired_state.go
M	internal/controllers/topology/cluster/desired_state_test.go
M	internal/controllers/topology/cluster/reconcile_state.go
M	internal/controllers/topology/cluster/reconcile_state_test.go
Falling back to patching base and 3-way merge...
Auto-merging internal/controllers/topology/cluster/reconcile_state_test.go
Auto-merging internal/controllers/topology/cluster/reconcile_state.go
CONFLICT (content): Merge conflict in internal/controllers/topology/cluster/reconcile_state.go
Auto-merging internal/controllers/topology/cluster/desired_state_test.go
Auto-merging internal/controllers/topology/cluster/desired_state.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0002 Best effort cleanup of referenced templates/objects
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

Let's cherry-pick (at least into 1.6)

/cherry-pick release-1.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sbueringer: #10277 failed to apply on top of branch "release-1.5":

Applying: Avoid leaving orphaned InfrastructureCluster when create control plane fails
Using index info to reconstruct a base tree...
M	internal/controllers/topology/cluster/reconcile_state.go
M	internal/controllers/topology/cluster/reconcile_state_test.go
Falling back to patching base and 3-way merge...
Auto-merging internal/controllers/topology/cluster/reconcile_state_test.go
Auto-merging internal/controllers/topology/cluster/reconcile_state.go
CONFLICT (content): Merge conflict in internal/controllers/topology/cluster/reconcile_state.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Avoid leaving orphaned InfrastructureCluster when create control plane fails
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-1.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@fabriziopandini We probably should cherry-pick manually

sbueringer avatar Mar 27 '24 06:03 sbueringer