karmada icon indicating copy to clipboard operation
karmada copied to clipboard

Fix operator upgrade issue

Open zhzhuang-zju opened this issue 5 months ago • 16 comments

What type of PR is this? /kind bug

What this PR does / why we need it: In PR #6180, the podSelectors for Karmada components such as etcd have been modified. Since the selector field of Deployments/StatefulSets is immutable, when the karmada-operator is upgraded, it attempts to modify this field while reconciling existing Karmada Custom Resources (CRs). This causes the reconciliation process to fail.

I0624 06:45:40.221750       1 controller.go:70] "Finished syncing karmada" karmada="test/karmada-demo" duration="69.143972ms"
E0624 06:45:40.221821       1 controller.go:347] "Reconciler error" err="failed to install etcd component, err: error when creating Etcd statefulset, err: StatefulSet.apps \"karmada-demo-etcd\" is invalid: spec: Forbidden: updates to statefulset spec for fields other than 'replicas', 'ordinals', 'template', 'updateStrategy', 'persistentVolumeClaimRetentionPolicy' and 'minReadySeconds' are forbidden" controller="karmada-operator-controller" controllerGroup="operator.karmada.io" controllerKind="Karmada" Karmada="test/karmada-demo" namespace="test" name="karmada-demo" reconcileID="80c957ab-16fe-4d2f-b1c3-febe4e8e31dc"

Which issue(s) this PR fixes: Fixes #

Special notes for your reviewer: How to reproduce it

  1. Install karmada-operator with imagetag v1.13.0
  2. Deploy a Karmada instance
kubectl apply -f - <<EOF
apiVersion: operator.karmada.io/v1alpha1
kind: Karmada
metadata:
  name: karmada-demo
  namespace: test
EOF
  1. Upgrade karmada-operator to v1.14.0
kubectl set image --namespace karmada-system deployments/karmada-operator karmada-operator=docker.io/karmada/karmada-operator:v1.14.0

Does this PR introduce a user-facing change?:

`karmada-operator`: Fix the issue where upgrading the Karmada Operator causes Karmada instances to fail

zhzhuang-zju avatar Jun 24 '25 06:06 zhzhuang-zju

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please ask for approval from zhzhuang-zju. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

karmada-bot avatar Jun 24 '25 06:06 karmada-bot

cc @jabellard

zhzhuang-zju avatar Jun 24 '25 06:06 zhzhuang-zju

FYI: this fix needs to be backported to release-1.14

zhzhuang-zju avatar Jun 24 '25 07:06 zhzhuang-zju

:warning: Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

:x: Patch coverage is 0% with 12 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 48.97%. Comparing base (b20715e) to head (7be84f6). :warning: Report is 424 commits behind head on master.

Files with missing lines Patch % Lines
operator/pkg/util/apiclient/idempotency.go 0.00% 12 Missing :warning:
:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6483      +/-   ##
==========================================
- Coverage   48.99%   48.97%   -0.02%     
==========================================
  Files         687      687              
  Lines       56090    56101      +11     
==========================================
- Hits        27479    27477       -2     
- Misses      26831    26843      +12     
- Partials     1780     1781       +1     
Flag Coverage Δ
unittests 48.97% <0.00%> (-0.02%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov-commenter avatar Jun 24 '25 07:06 codecov-commenter

/assign

jabellard avatar Jun 24 '25 13:06 jabellard

@zhzhuang-zju , thanks for this work. Generally looks good to me. Can you confirm we can still run multiple Karmada instances in the same namespace without any issues?

jabellard avatar Jun 25 '25 00:06 jabellard

@zhzhuang-zju , thanks for this work. Generally looks good to me. Can you confirm we can still run multiple Karmada instances in the same namespace without any issues?

Sure, I will do some tests and show you the test results

zhzhuang-zju avatar Jun 25 '25 01:06 zhzhuang-zju

$ kubectl get karmadas.operator.karmada.io --namespace test
NAME            READY   AGE
karmada-demo    True    3h43m
karmada-demo1   True    3h32m
karmada-demo2   True    3h4m
$ kubectl get pods --namespace test
karmada-demo-aggregated-apiserver-7899df7588-dsq9l       1/1     Running   0               3h28m
karmada-demo-apiserver-7df8997c74-dhrcs                  1/1     Running   0               3h28m
karmada-demo-controller-manager-559fbc945c-gmpjf         1/1     Running   0               3h28m
karmada-demo-etcd-0                                      1/1     Running   0               3h28m
karmada-demo-kube-controller-manager-d647b787-5wnmn      1/1     Running   0               3h28m
karmada-demo-metrics-adapter-7b5677db75-8x5jn            1/1     Running   0               3h28m
karmada-demo-metrics-adapter-7b5677db75-vcc5k            1/1     Running   0               3h28m
karmada-demo-scheduler-c7dd6944-lgb4z                    1/1     Running   0               3h28m
karmada-demo-webhook-5d86dcc959-wj4kw                    1/1     Running   0               3h28m
karmada-demo1-aggregated-apiserver-69c659b64d-nv5nt      1/1     Running   2 (3h24m ago)   3h24m
karmada-demo1-apiserver-5859c9fcd-c66w5                  1/1     Running   0               3h24m
karmada-demo1-controller-manager-75fbfdfdb9-d67pg        1/1     Running   0               3h24m
karmada-demo1-etcd-0                                     1/1     Running   0               3h24m
karmada-demo1-kube-controller-manager-7c894c75c8-r99b2   1/1     Running   0               3h24m
karmada-demo1-metrics-adapter-7cd9695769-bb9j4           1/1     Running   0               3h24m
karmada-demo1-metrics-adapter-7cd9695769-j45tp           1/1     Running   0               3h24m
karmada-demo1-scheduler-774c45f99d-xx9jz                 1/1     Running   0               3h24m
karmada-demo1-webhook-694757d499-m888r                   1/1     Running   0               3h24m
karmada-demo2-aggregated-apiserver-84775d48c7-zvvzc      1/1     Running   2 (176m ago)    176m
karmada-demo2-apiserver-774974ffcd-spvkm                 1/1     Running   0               176m
karmada-demo2-controller-manager-5b4fd69c89-k27rc        1/1     Running   0               176m
karmada-demo2-etcd-0                                     1/1     Running   0               176m
karmada-demo2-kube-controller-manager-5b9b4fd5f6-x2j7s   1/1     Running   0               176m
karmada-demo2-metrics-adapter-7cfcd45b88-dwwp8           1/1     Running   0               176m
karmada-demo2-metrics-adapter-7cfcd45b88-kwf9d           1/1     Running   0               176m
karmada-demo2-scheduler-579c695575-nt8dg                 1/1     Running   0               176m
karmada-demo2-webhook-6b44859c96-6j4bq                   1/1     Running   0               176m

Three Karmada instances have been deployed under the namespace test. Among them, the karmada-demo instance and the karmada-operator were upgraded from v1.13.0 together, while the other two instances were deployed after the karmada-operator was upgraded.

$ kubectl --kubeconfig karmada-apiserver.config get deployments.apps                                                  
NAME    READY   UP-TO-DATE   AVAILABLE   AGE
nginx   0/2     0            0           170m
$ kubectl --kubeconfig karmada-apiserver1.config get deployments.apps
NAME       READY   UP-TO-DATE   AVAILABLE   AGE
my-nginx   0/2     0            0           3h23m
$ kubectl --kubeconfig karmada-apiserver2.config get deployments.apps
No resources found in default namespace.

@jabellard

zhzhuang-zju avatar Jun 25 '25 06:06 zhzhuang-zju

When validating locally, I found a strange phenomenon. During the installation process, the component aggregated-apiserver will restart several times. The log is as follows:

I0625 06:55:49.520045       1 options.go:114] karmada-aggregated-apiserver version: version.Info{GitVersion:"v1.14.0", GitCommit:"42dfade3feef1d53928e6227c3d2a25a75b1a2b5", GitShortCommit:"42dfade3f", GitTreeState:"clean", BuildDate:"2025-05-30T12:36:11Z", GoVersion:"go1.23.8", Compiler:"gc", Platform:"linux/amd64"}
E0625 06:55:50.960763       1 run.go:72] "command failed" err="unable to load configmap based request-header-client-ca-file: Get \"https://172.18.0.6:32266/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication\": dial tcp 172.18.0.6:32266: connect: connection refused"

However, the same situation occurred when I used the karmada-operator of v1.14.0.

@jabellard Have you encountered this situation before?

Updated: I found the reason why karmada-aggregated-apiserver will restart during the installation process, and I'll fix it in another pr

Ref to #6492

zhzhuang-zju avatar Jun 25 '25 06:06 zhzhuang-zju

Three Karmada instances have been deployed under the namespace test. Among them, the karmada-demo instance and the karmada-operator were upgraded from v1.13.0 together, while the other two instances were deployed after the karmada-operator was upgraded.

@zhzhuang-zju , thanks for confirming.

jabellard avatar Jun 26 '25 13:06 jabellard

When validating locally, I found a strange phenomenon. During the installation process, the component aggregated-apiserver will restart several times. The log is as follows:

I0625 06:55:49.520045       1 options.go:114] karmada-aggregated-apiserver version: version.Info{GitVersion:"v1.14.0", GitCommit:"42dfade3feef1d53928e6227c3d2a25a75b1a2b5", GitShortCommit:"42dfade3f", GitTreeState:"clean", BuildDate:"2025-05-30T12:36:11Z", GoVersion:"go1.23.8", Compiler:"gc", Platform:"linux/amd64"}
E0625 06:55:50.960763       1 run.go:72] "command failed" err="unable to load configmap based request-header-client-ca-file: Get \"https://172.18.0.6:32266/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication\": dial tcp 172.18.0.6:32266: connect: connection refused"

However, the same situation occurred when I used the karmada-operator of v1.14.0.

@jabellard Have you encountered this situation before?

Updated: I found the reason why karmada-aggregated-apiserver will restart during the installation process, and I'll fix it in another pr

Ref to #6492

Thanks. I'm going to take a look.

jabellard avatar Jun 26 '25 13:06 jabellard

/lgtm

jabellard avatar Jun 26 '25 13:06 jabellard

@RainbowMango , would you like to take another look?

jabellard avatar Jun 26 '25 13:06 jabellard

@zhzhuang-zju Sorry for missing this. It seems we haven't reached an agreement about it, right? Based on discussion at https://github.com/karmada-io/karmada/pull/6483#discussion_r2174582152.

RainbowMango avatar Dec 04 '25 07:12 RainbowMango

It seems we haven't reached an agreement about it, right? Based on discussion at https://github.com/karmada-io/karmada/pull/6483#discussion_r2174582152.

Yes, the main disagreement lies in how to handle this upgrade-incompatible change. Broadly speaking, we have the following options:

  • Option 1: Keep the selector unchanged during upgrade (i.e., as implemented in this PR).
    Pros: Users are not impacted.
    Cons: The deprecated label remains in place.

  • Option 2: During upgrade, if the failure is due to an immutable field update, delete the old workload first and then create a new one. (automatically)

  • Option 3: When an upgrade fails, let the user manually recreate the workload. (manually)

Currently, I’m leaning toward Option 3. In this case, when an upgrade fails, Karmada CR’s conditions correctly reflect the failure reason.

  conditions:
  - lastTransitionTime: "2025-12-08T03:09:18Z"
    message: 'failed to install etcd component, err: error when creating Etcd statefulset,
      err: StatefulSet.apps "karmada-demo-etcd" is invalid: spec: Forbidden: updates
      to statefulset spec for fields other than ''replicas'', ''ordinals'', ''template'',
      ''updateStrategy'', ''revisionHistoryLimit'', ''persistentVolumeClaimRetentionPolicy''
      and ''minReadySeconds'' are forbidden'
    reason: Failed
    status: "False"
    type: Ready

However, we should update our release notes to treat this as an urgent upgrade requirement, not merely a bug fix.

zhzhuang-zju avatar Dec 08 '25 03:12 zhzhuang-zju

However, we should update our release notes to treat this as an urgent upgrade requirement, not merely a bug fix.

For example:

When upgrading `karmada-operator` from v1.13.0 to v1.14.0, the label selector used by the operator for Karmada control 
plane components has changed—from `karmada-app` to the standard labels `app.kubernetes.io/instance` and `app.kubernetes.io/name`.

Since Kubernetes does not allow updating the `spec.selector` of existing resources, please manually delete the existing 
Karmada control plane components (such as `karmada-apiserver`, `etcd`, etc.) after the upgrade. The operator will automatically 
recreate them with the new label selectors.

zhzhuang-zju avatar Dec 09 '25 11:12 zhzhuang-zju