kraan Moving addons to higher priority layers doesn't prune from the lower layer

Describe the bug When moving an addon from one layer to another, the reconciliation doesn't complete successfully because the old resources are not pruned.

To Reproduce Steps to reproduce the behavior:

Deploy two layers, one dependent on the other one. E.g. app-layer is dependent on base-layer. Let's say app-layer has two addons: app-runtime and app-logging. Base layer contains at least one addon: base-addon
Move app-logging from app-layer to base-layer under the name base-logging instead of app-logging. Update layer versions, git-repository source, etc
Apply the new AddonsLayers manifest files.
See error: base-logging cannot be installed because its resources already exist (i.e. deployment already exists) and they are not managed by this helm release.

Expected behavior When I remove an addon from a layer, it will be pruned regardless of the dependency relationship to other layers.

Kraan Helm Chart Version = v0.2.8 Kubernetes Version = v1.19.9

Jun 09 '21 14:06 avacaru

@avacaru the change of name of the helmrelease custom resource will prevent orphan/adoption processing. However I’d expect the app-logging resources to be deleted when the app-logging helm release is deleted by prune processing so the base-logging can be deployed but initially the app-layer pruning may still be in progress while the base-layer (which has nothing to prune) proceeds to the apply phase. The base-logging helm release will fail but should be retried. However Kraan just deploys the helm releases, it relies on the helm release being retried. check the retry settings?

Jun 09 '21 19:06 nab-gha

@paulcarlton-ww This is what I have in that HR's manifest:

spec:
  install:
    remediation:
      retries: -1
  upgrade:
    remediation:
      retries: -1

I was expecting the same thing, even after the initial failure I was hoping that it will be fixed in the next reconciliation cycle.

Also I can see this error in the logs of kraan-controller:

Operation cannot be fulfilled on addonslayers.kraan.io "app-layer": the object has been modified; please apply your changes to the latest version and try again

controllers.(*AddonsLayerReconciler).update - addons_controller.go(896) - failed to update
github.com/fidelity/kraan/controllers.(*AddonsLayerReconciler).update
	/workspace/controllers/addons_controller.go:896
github.com/fidelity/kraan/controllers.(*AddonsLayerReconciler).updateRequeue
	/workspace/controllers/addons_controller.go:702
github.com/fidelity/kraan/controllers.(*AddonsLayerReconciler).Reconcile
	/workspace/controllers/addons_controller.go:837
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:244
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:218
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:197
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.Until
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:90
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1373
controllers.(*AddonsLayerReconciler).updateRequeue - addons_controller.go(703) - failed to update

Does that help?

Jun 10 '21 07:06 avacaru

The kraan log error is normal, this is the controller runtime encountering a resource version mismatch due to concurrent reconciliation of a layer, the standard practice in Kubernetes is to redo the reconcile using latest version of layer and it should be ok so Kraan schedules an immediate re-reconcile. I'd only be concerned if copious numbers of this error are being generated?

The retry spec is correct. I'm wondering if this is a helm/helm-controller issue? If you restart the helm-controller or delete the helmrelease does that fix it?

Jun 10 '21 07:06 nab-gha

There is indeed a high number of these error, one aprox. every minute (the AddonsLayer interval setting).

I've tried to delete the HR manually and then all the layers end up being successfully deployed. I haven't tried yet to restart the helm-controller, but I'll give it a go now.

Jun 10 '21 07:06 avacaru

one a minute suggests the periodic sync of all layers, i.e. the controller reconciles all layers every minute. I expect this clashes with the repeated attempts to process the layer with the base-logging helmrelease in error in.

I think this is a helm-controller/helm issue, all Kraan is responsible for is applying changes to HelmRelease objects, it relies on the helm-controller applying that change. The fact that deleting the HR fixes the issue suggests that something is not working right in the helm-controller, possibly due to some nuance of the way helm works

Jun 10 '21 07:06 nab-gha

I've checked the helm-controller logs and I can't see any errors about the initial app-logging HelmRelease failing to be deleted. All I see is that the reconciliation has succeeded. I am wondering, is the kraan-controller ever deleting the app-logging HR?

Any other logs I can look into in order to track this issue down?

Jun 10 '21 08:06 avacaru

Check kraan log, depending on verbosity setting you should see app-logging HR being deleted, but easier to just check using kubectl, it should be gone and all the resources it created should have been deleted too, check the deployment.apps object it should have been deleted, if it is still there with owner/annotation indicating that it belongs to app-logging HR then that is the cause of the issue but I assumed from your orginal description that this deletion had occurred.

Jun 10 '21 08:06 nab-gha

From a Helm-Controller maintainer... The helm-controller does indeed not garbage collect if you change whatever release it is pointed at The other mention about the “already modified” error is something I have been working on across controllers. There is a patch helper pending in https://github.com/fluxcd/pkg/pull/101 that knows how to deal with conflicts, integration of this is bundled with some other changes and will take some time before it will be available

@avacaru Please raise a Helm-Controller issue at https://github.com/fluxcd/helm-controller/issues

Jun 10 '21 08:06 nab-gha

Check kraan log, depending on verbosity setting you should see app-logging HR being deleted, but easier to just check using kubectl, it should be gone and all the resources it created should have been deleted too, check the deployment.apps object it should have been deleted, if it is still there with owner/annotation indicating that it belongs to app-logging HR then that is the cause of the issue but I assumed from your orginal description that this deletion had occurred.

That is the actual problem, the old HR does not get removed. I can still see it when I run kubectl get hr. That's why the new HR, in the new layer, can't get deployed, because the resources with the same name already exist.

Helm install failed: rendered manifests contain a resource that already exists. Unable to continue with install: ServiceAccount "logging-sa" in namespace "my-namespace" exists and cannot be imported into the current release: invalid ownership metadata; annotation validation error: key "meta.helm.sh/release-name" must equal "base-logging": current value is "app-logging"

Also, if I remove the old HR manually, then the reconciliation succeeds. All layers will have status Deployed, whereas they are stuck like this:

NAME              VERSION         SOURCE                          PATH                                  STATUS                MESSAGE
base-layer        0.6.0           base-layer-git-location         ./add-ons/helm-releases               Failed                AddonsLayer failed, HelmRelease: my-namespace/base-logging, not ready
app-layer         1.1.1           app-layer-git-location          ./add-ons/helm-releases               ApplyPending          Waiting for layer: base-layer, to apply source revision: 0.6.0/523d71d14a89cd461b63914055e5e1d593e9098d. Layer: base-layer, current state: Failed, deployed revision: 0.5.0/22b5c459bfd54a50e7d6b90e05c16c331b50040c.

Can I get someone to reproduce this issue before I open an issue in the helm-controller, please?

Jun 10 '21 13:06 avacaru

@avacaru The app-logging HR should have been pruned by Kraan controller, please post kraan logs here. If you need to re-run to recreate set log level to 4 for copious logging, thanks

Jun 10 '21 13:06 nab-gha

@paulcarlton-ww I've set log level to 4 and reproduced the bug. I can't see any error in the logs, the only reference to pruning is in these logs:

{"level":"Level(-4)","ts":"2021-06-10T14:52:47.194Z","logger":"kraan.controller.reconciler","msg":"Entering function","function":"controllers.(*AddonsLayerReconciler).processPrune","source":"addons_controller.go","line":384}
{"level":"Level(-4)","ts":"2021-06-10T14:52:47.194Z","logger":"kraan.controller.reconciler","msg":"Entering function","layer":"app-layer","function":"apply.KubectlLayerApplier.PruneIsRequired","source":"layerApplier.go","line":802}

No other reference that would indicate the layer has been pruned of addons not present in the git source anymore.

Jun 10 '21 15:06 avacaru

Can you post the complete log?

Jun 10 '21 15:06 nab-gha

Can you post the complete log?

Unfortunately not, but if you don't have an environment to reproduce this, what should I look for in the logs? Any particular string I can search for that demonstrates the app-logging HR has been removed by the kraan-controller?

Jun 10 '21 17:06 avacaru

@avacaru I will recreate this myself when I get time but that might not be for a few days

Jun 10 '21 17:06 nab-gha

@avacaru I've tested this using head of master code, seems to work fine in both directions Can you share the HR definitions for base-logging and app-logging, are the images being used public?

Jun 11 '21 11:06 nab-gha

kraan kraan copied to clipboard

Moving addons to higher priority layers doesn't prune from the lower layer

kraan
kraan copied to clipboard