Upgrade to 1.6.1 prevents applying of Serving and Eventing CRs using server-side apply
After upgrading to Operator 1.6.1, some Serving and Eventing CRs will no longer apply using server-side apply. This appears to happen on clusters that at one point had pre v1beta1 CRs.
Repro Steps
- Install Operator
1.2.2 - Apply a Serving or Eventing CR using
v1alpha1using a server-side apply (e.g.kubectl apply --server-side=true) - Upgrade to Operator version
1.5.3(note that this repros upgrading with1.3.2and1.4.1as intermediate steps as well) - Update CRs to use
v1beta1using server-side apply - Upgrade to Operator version
1.6.1 - Perform a server-side apply on any CR (no changes necessary)
Expected
Server-side apply succeeds.
Actual
Server-side apply fails with the following error:
Error from server: request to convert CR to an invalid group/version: operator.knative.dev/v1alpha1
Additional Information
This was reproduced on live clusters have have used the Operator since at least release v1.0. This appears to be due to a lingering metadata.managedFields entry that references operator.knative.dev/v1alpha1. You can find a script that easily reproduces this issue in this repo.
https://github.com/tshak/knoperator-v1beta1-ssa-repro/pull/1 This is the change i requested: upgrade operator incrementally one minor version by one minor version
Give more time like 30s between each installation and before applying the v1beta1 CR. @tshak
I tested it with docker-desktop, and it worked.
Regarding the managedFields, I am not sure whether it can resolve your issue, by waiting for longer time before applying the v1beta1 CR.
I do not think that it's a timing issue as our automation in our live cluster retries applying the CR and was failing many hours after the Operator upgrade to v1.6.1. For the synthetic repro I ran your PR and I'm still seeing the issue in a brand new Kind cluster.
@evankanderson @dprotaso
Here is what we can possibly do with the migrateresource function in knative/pkg. Now we are doing
_, err := client.Namespace(item.GetNamespace()).
Patch(ctx, item.GetName(), types.MergePatchType, []byte("{}"), metav1.PatchOptions{})
to migrate the existing CRs from old to new version. Can we change it into:
_, err := client.Namespace(item.GetNamespace()).
Patch(ctx, item.GetName(), types.MergePatchType, []byte(`{"metadata":{"managedFields": [{}]}}`), metav1.PatchOptions{})
to clear the managedField in order to solve this issue as described for managedField with old version, blocking the installation of the CR at the new version?
This should solve the issue. I'm not sure if there are any unintended consequences of deleting all managedField entries. The best solution is probably to remove only operator.knative.dev/v1alpha1. I'm not very opinionated either way though.
My guess is the post-install job isn't being run after installing 1.5.3
https://github.com/knative/operator/blob/knative-v1.5.3/config/post-install/storage-version-migrator.yaml
This should migrate the CRD to a new storage version
@dprotaso I actually moved the implementation into the operator controller itself in 1.5.3. So we do not need to run the migration job, the operator will do it automatically, when it is launched. It is equivalent to run the post-install job. What I am proposing here is to "clear the managedField" or remove only operator.knative.dev/v1alpha1 as @tshak said.
It is correct that we did not run the post-install job. My contention is that this goes against the biggest value proposition of the Operator pattern: to automate resource lifecycle issues such as this one.
This seems like a k8s bug - let me see if I can repro with a trivial example
I agree that it's probably a k8s bug. If a managedField entry has an invalid apiVersion it should not prevent a server-side apply. There may be another bug around API conversions missing the managedField entry in the first place. Regardless, I still think that the Operator should still perform either of the workarounds proposed by @houshengbo.
@tshak Another way without touching knative/pkg could be only accessing serving and eventing CRs in knative operator and changing the managedField for both of them, if they exist.
Created the upstream issue based on this slack discussion
https://github.com/kubernetes/kubernetes/issues/111937
This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.
This issue or pull request is stale because it has been open for 90 days with no activity.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle rotten - Close this issue or PR with
/close
/lifecycle stale
This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.
/reopen /lifecycle frozen
The issue isn't resolved upstream
@dprotaso: Reopened this issue.
In response to this:
/reopen /lifecycle frozen
The issue isn't resolved upstream
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.