operator-sdk Operator upgrade from 0.19 to 1.16

Operator upgrade from 0.19 to 1.16

Open geoghegk opened this issue 2 years ago • 5 comments

Bug Report

What did you do?

I have a helm operator. I have migrated the SDK from 0.19 to 1.16. The migration seems to have gone OK. I have a functioning helm operator based on 1.16. I am having a problem when my product is doing an operator upgrade from the version that is based on 0.19 to the version that is based on 1.16. The issue is with a stateful set that has storage configured. The PVC that is attached to the StatefulSet (which is rollingupdate strategy) sometimes gets re-created during upgrade causing data loss during upgrade. This happens about 25% of the time, the rest of the time the upgrade is successful and data is not lost.

The operator log contains the following:

{"level":"error","ts":1652376897.4950724,"logger":"helm.controller","msg":"Release failed","namespace":"a1","name":"ta","apiVersion":"ta.ibm.com/v2","kind":"TransAdv","release":"ta","error":"failed upgrade (update: failed to update: secrets \"sh.helm.release.v1.ta.v2\" not found) and failed rollback: release: not found","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\toperator-sdk/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\toperator-sdk/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227"}

Strangely, I always see that log, whether the upgrade has been successful (i.e. without data loss) or not.

Before I do the upgrade, there is a helm release secret in the namespace: sh.helm.release.v1.ta.v1 After the upgrade - it looks like the secret has been overridden. Is that correct behavior? Is that correct behavior? If so, why is upgrade looking for a v2 secret?

If I create another version of the operator based on 1.16, and perform the upgrade, then I do see that a v2 secret is created during upgrade (sh.helm.release.v1.ta.v2). Any ideas on why coming from my pre 1.x operator to post 1.x operator would result in the secret being overridden instead of a new version of secret created? What id/key does helm use to decide the unique release? Thanks.

What did you expect to see?

Upgrade occurs without data loss.

What did you see instead? Under which circumstances?

PVCs are destroyed and recreated during upgrade. Its like a reinstall is done rather than an upgrade.

Environment

Operator type: helm operator

/language helm

Kubernetes cluster type:

OpenShift 4.10.3

$ operator-sdk version operator-sdk version: "v1.16.0", commit: "560044140c4f3d88677e4ef2872931f5bb97f255", kubernetes version: "1.21", go version: "go1.16.13", GOOS: "darwin", GOARCH: "amd64"

$ go version (if language is Go)

$ kubectl version

Possible Solution

Additional context

Are there any known issues when upgrading an operator based on pre 1.x to operator based on t.x . Any thoughts on troubleshooting the issue. Thanks.

May 12 '22 09:05 geoghegk

It appears that the v2 release secret gets created momentarily during the upgrade process. See the followign where we start out with v1, then upgrade kicks off and we get a v2, and then the v2 goes away, but you can see the age on the v1 is updated indicated that it has been updated - as if the v2 has been overwritten into the v1:

> kubectl get secret --all-namespaces -l "owner=helm"
NAMESPACE   NAME                       TYPE                 DATA   AGE
a1          sh.helm.release.v1.ta.v1   helm.sh/release.v1   1      9m2s
> kubectl get secret --all-namespaces -l "owner=helm"
NAMESPACE   NAME                       TYPE                 DATA   AGE
a1          sh.helm.release.v1.ta.v1   helm.sh/release.v1   1      9m28s
a1          sh.helm.release.v1.ta.v2   helm.sh/release.v1   1      13s
> kubectl get secret --all-namespaces -l "owner=helm"
NAMESPACE   NAME                       TYPE                 DATA   AGE
a1          sh.helm.release.v1.ta.v1   helm.sh/release.v1   1      98s
> kubectl get secret --all-namespaces -l "owner=helm"
NAMESPACE   NAME                       TYPE                 DATA   AGE
a1          sh.helm.release.v1.ta.v1   helm.sh/release.v1   1      108s
> kubectl get secret --all-namespaces -l "owner=helm"
NAMESPACE   NAME                       TYPE                 DATA   AGE
a1          sh.helm.release.v1.ta.v1   helm.sh/release.v1   1      2m31s

May 13 '22 11:05 geoghegk

@varshaprasad96 could you take a look at this?

May 23 '22 18:05 asmacdo

So I think I have isolated the likely cause of this problem. In the old operator (based on operator sdk 0.19) there is a custom entry point. It launches the helm-operator process, however, it does not use the exec command. i.e. it has this:

${OPERATOR} --watches-file=$HOME/watches.yaml $@

instead of what I believe it should be:

exec ${OPERATOR} --watches-file=$HOME/watches.yaml $@

Without the exec command, it seems that during the upgrade, the old operator continues to try to reconcile the release even after the new operator is elected leader. Both operators operating on the release may be causing the problem. I'm still testing, but adding in the exec appears to resolve the issue. I haven't managed to reproduce the problem in a simple hello world helm operator yet though.

One option to resolve I think would be to patch the old operator to add in the exec where the patch is built using the old operator sdk. Also, I would be interest in hearing if there is some alternative for addressing the problem in the new operator. i.e. is there something that I could do in the new operator to ensure that the old operator does not operate on the release once it has become leader.

May 24 '22 14:05 geoghegk

@geoghegk Thanks for digging in. Since this is from a very old version of SDK which is not supported (0.19 - pre 1.0 release), we can't release a patch. But adding documentation on this would really be helpful for someone doing an upgrade.

May 26 '22 19:05 varshaprasad96

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Sep 05 '22 01:09 openshift-bot

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten /remove-lifecycle stale

Oct 05 '22 08:10 openshift-bot

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen. Mark the issue as fresh by commenting /remove-lifecycle rotten. Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Nov 05 '22 00:11 openshift-bot

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen. Mark the issue as fresh by commenting /remove-lifecycle rotten. Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Nov 05 '22 00:11 openshift-ci[bot]

operator-sdk operator-sdk copied to clipboard

Operator upgrade from 0.19 to 1.16

Bug Report

What did you do?

What did you expect to see?

What did you see instead? Under which circumstances?

Environment

Possible Solution

Additional context

operator-sdk
operator-sdk copied to clipboard