operator Upgrade to 1.6.1 prevents applying of Serving and Eventing CRs using server-side apply

After upgrading to Operator 1.6.1, some Serving and Eventing CRs will no longer apply using server-side apply. This appears to happen on clusters that at one point had pre v1beta1 CRs.

Repro Steps

Install Operator 1.2.2
Apply a Serving or Eventing CR using v1alpha1 using a server-side apply (e.g. kubectl apply --server-side=true)
Upgrade to Operator version 1.5.3 (note that this repros upgrading with 1.3.2 and 1.4.1 as intermediate steps as well)
Update CRs to use v1beta1 using server-side apply
Upgrade to Operator version 1.6.1
Perform a server-side apply on any CR (no changes necessary)

Expected

Server-side apply succeeds.

Actual

Server-side apply fails with the following error:

Error from server: request to convert CR to an invalid group/version: operator.knative.dev/v1alpha1

Additional Information

This was reproduced on live clusters have have used the Operator since at least release v1.0. This appears to be due to a lingering metadata.managedFields entry that references operator.knative.dev/v1alpha1. You can find a script that easily reproduces this issue in this repo.

Aug 17 '22 17:08 tshak

https://github.com/tshak/knoperator-v1beta1-ssa-repro/pull/1 This is the change i requested: upgrade operator incrementally one minor version by one minor version

Give more time like 30s between each installation and before applying the v1beta1 CR. @tshak

I tested it with docker-desktop, and it worked.

Regarding the managedFields, I am not sure whether it can resolve your issue, by waiting for longer time before applying the v1beta1 CR.

Aug 17 '22 20:08 houshengbo

I do not think that it's a timing issue as our automation in our live cluster retries applying the CR and was failing many hours after the Operator upgrade to v1.6.1. For the synthetic repro I ran your PR and I'm still seeing the issue in a brand new Kind cluster.

Aug 17 '22 21:08 tshak

@evankanderson @dprotaso

Here is what we can possibly do with the migrateresource function in knative/pkg. Now we are doing

_, err := client.Namespace(item.GetNamespace()).
			Patch(ctx, item.GetName(), types.MergePatchType, []byte("{}"), metav1.PatchOptions{})

to migrate the existing CRs from old to new version. Can we change it into:

_, err := client.Namespace(item.GetNamespace()).
			Patch(ctx, item.GetName(), types.MergePatchType, []byte(`{"metadata":{"managedFields": [{}]}}`), metav1.PatchOptions{})

to clear the managedField in order to solve this issue as described for managedField with old version, blocking the installation of the CR at the new version?

Aug 18 '22 17:08 houshengbo

This should solve the issue. I'm not sure if there are any unintended consequences of deleting all managedField entries. The best solution is probably to remove only operator.knative.dev/v1alpha1. I'm not very opinionated either way though.

Aug 18 '22 17:08 tshak

My guess is the post-install job isn't being run after installing 1.5.3

https://github.com/knative/operator/blob/knative-v1.5.3/config/post-install/storage-version-migrator.yaml

This should migrate the CRD to a new storage version

Aug 18 '22 17:08 dprotaso

@dprotaso I actually moved the implementation into the operator controller itself in 1.5.3. So we do not need to run the migration job, the operator will do it automatically, when it is launched. It is equivalent to run the post-install job. What I am proposing here is to "clear the managedField" or remove only operator.knative.dev/v1alpha1 as @tshak said.

Aug 18 '22 19:08 houshengbo

It is correct that we did not run the post-install job. My contention is that this goes against the biggest value proposition of the Operator pattern: to automate resource lifecycle issues such as this one.

Aug 18 '22 20:08 tshak

This seems like a k8s bug - let me see if I can repro with a trivial example

Aug 18 '22 21:08 dprotaso

I agree that it's probably a k8s bug. If a managedField entry has an invalid apiVersion it should not prevent a server-side apply. There may be another bug around API conversions missing the managedField entry in the first place. Regardless, I still think that the Operator should still perform either of the workarounds proposed by @houshengbo.

Aug 19 '22 06:08 tshak

@tshak Another way without touching knative/pkg could be only accessing serving and eventing CRs in knative operator and changing the managedField for both of them, if they exist.

Aug 19 '22 09:08 houshengbo

Created the upstream issue based on this slack discussion

https://github.com/kubernetes/kubernetes/issues/111937

Aug 19 '22 16:08 dprotaso

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

Nov 18 '22 01:11 github-actions[bot]

This issue or pull request is stale because it has been open for 90 days with no activity.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close

/lifecycle stale

Dec 18 '22 01:12 knative-prow-robot

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

Mar 21 '23 01:03 github-actions[bot]

/reopen /lifecycle frozen

The issue isn't resolved upstream

Apr 20 '23 08:04 dprotaso

@dprotaso: Reopened this issue.

In response to this:

/reopen /lifecycle frozen

The issue isn't resolved upstream

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Apr 20 '23 08:04 knative-prow[bot]