operator-sdk
operator-sdk copied to clipboard
Operator upgrade from 0.19 to 1.16
Bug Report
What did you do?
I have a helm operator. I have migrated the SDK from 0.19 to 1.16. The migration seems to have gone OK. I have a functioning helm operator based on 1.16. I am having a problem when my product is doing an operator upgrade from the version that is based on 0.19 to the version that is based on 1.16. The issue is with a stateful set that has storage configured. The PVC that is attached to the StatefulSet (which is rollingupdate strategy) sometimes gets re-created during upgrade causing data loss during upgrade. This happens about 25% of the time, the rest of the time the upgrade is successful and data is not lost.
The operator log contains the following:
{"level":"error","ts":1652376897.4950724,"logger":"helm.controller","msg":"Release failed","namespace":"a1","name":"ta","apiVersion":"ta.ibm.com/v2","kind":"TransAdv","release":"ta","error":"failed upgrade (update: failed to update: secrets \"sh.helm.release.v1.ta.v2\" not found) and failed rollback: release: not found","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\toperator-sdk/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\toperator-sdk/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227"}
Strangely, I always see that log, whether the upgrade has been successful (i.e. without data loss) or not.
Before I do the upgrade, there is a helm release secret in the namespace: sh.helm.release.v1.ta.v1 After the upgrade - it looks like the secret has been overridden. Is that correct behavior? Is that correct behavior? If so, why is upgrade looking for a v2 secret?
If I create another version of the operator based on 1.16, and perform the upgrade, then I do see that a v2 secret is created during upgrade (sh.helm.release.v1.ta.v2). Any ideas on why coming from my pre 1.x operator to post 1.x operator would result in the secret being overridden instead of a new version of secret created? What id/key does helm use to decide the unique release? Thanks.
What did you expect to see?
Upgrade occurs without data loss.
What did you see instead? Under which circumstances?
PVCs are destroyed and recreated during upgrade. Its like a reinstall is done rather than an upgrade.
Environment
Operator type: helm operator
/language helm
Kubernetes cluster type:
OpenShift 4.10.3
$ operator-sdk version
operator-sdk version: "v1.16.0", commit: "560044140c4f3d88677e4ef2872931f5bb97f255", kubernetes version: "1.21", go version: "go1.16.13", GOOS: "darwin", GOARCH: "amd64"
$ go version
(if language is Go)
$ kubectl version
Possible Solution
Additional context
Are there any known issues when upgrading an operator based on pre 1.x to operator based on t.x . Any thoughts on troubleshooting the issue. Thanks.
It appears that the v2 release secret gets created momentarily during the upgrade process. See the followign where we start out with v1, then upgrade kicks off and we get a v2, and then the v2 goes away, but you can see the age on the v1 is updated indicated that it has been updated - as if the v2 has been overwritten into the v1:
> kubectl get secret --all-namespaces -l "owner=helm"
NAMESPACE NAME TYPE DATA AGE
a1 sh.helm.release.v1.ta.v1 helm.sh/release.v1 1 9m2s
> kubectl get secret --all-namespaces -l "owner=helm"
NAMESPACE NAME TYPE DATA AGE
a1 sh.helm.release.v1.ta.v1 helm.sh/release.v1 1 9m28s
a1 sh.helm.release.v1.ta.v2 helm.sh/release.v1 1 13s
> kubectl get secret --all-namespaces -l "owner=helm"
NAMESPACE NAME TYPE DATA AGE
a1 sh.helm.release.v1.ta.v1 helm.sh/release.v1 1 98s
> kubectl get secret --all-namespaces -l "owner=helm"
NAMESPACE NAME TYPE DATA AGE
a1 sh.helm.release.v1.ta.v1 helm.sh/release.v1 1 108s
> kubectl get secret --all-namespaces -l "owner=helm"
NAMESPACE NAME TYPE DATA AGE
a1 sh.helm.release.v1.ta.v1 helm.sh/release.v1 1 2m31s
@varshaprasad96 could you take a look at this?
So I think I have isolated the likely cause of this problem. In the old operator (based on operator sdk 0.19) there is a custom entry point. It launches the helm-operator process, however, it does not use the exec command. i.e. it has this:
${OPERATOR} --watches-file=$HOME/watches.yaml $@
instead of what I believe it should be:
exec ${OPERATOR} --watches-file=$HOME/watches.yaml $@
Without the exec command, it seems that during the upgrade, the old operator continues to try to reconcile the release even after the new operator is elected leader. Both operators operating on the release may be causing the problem.
I'm still testing, but adding in the exec
appears to resolve the issue.
I haven't managed to reproduce the problem in a simple hello world helm operator yet though.
One option to resolve I think would be to patch the old operator to add in the exec
where the patch is built using the old operator sdk.
Also, I would be interest in hearing if there is some alternative for addressing the problem in the new operator. i.e. is there something that I could do in the new operator to ensure that the old operator does not operate on the release once it has become leader.
@geoghegk Thanks for digging in. Since this is from a very old version of SDK which is not supported (0.19 - pre 1.0 release), we can't release a patch. But adding documentation on this would really be helpful for someone doing an upgrade.
Issues go stale after 90d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen
.
If this issue is safe to close now please do so with /close
.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen
.
If this issue is safe to close now please do so with /close
.
/lifecycle rotten /remove-lifecycle stale
Rotten issues close after 30d of inactivity.
Reopen the issue by commenting /reopen
.
Mark the issue as fresh by commenting /remove-lifecycle rotten
.
Exclude this issue from closing again by commenting /lifecycle frozen
.
/close
@openshift-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue by commenting
/reopen
. Mark the issue as fresh by commenting/remove-lifecycle rotten
. Exclude this issue from closing again by commenting/lifecycle frozen
./close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.