kubecf icon indicating copy to clipboard operation
kubecf copied to clipboard

upgrade to kubecf-v2.7.1, scheduler pods disappeared.

Open ShuangMen opened this issue 5 years ago • 9 comments

Describe the bug step 1. kubecf v2.6.1 installed in Kubernetes (version 1.19.5_1529) step 2. run helm upgrade kubecf, try to upgrade kubecf to v2.7.1, with 'cc_deployment_updater: false' in file 'config/jobs.yaml'

after upgrade, check the statefulsets:

$k get statefulset -n kubecf
NAME                     READY   AGE
api-z0                   1/1     163m
api-z1                   1/1     163m
auctioneer               1/1     164m
cc-worker-z0             1/1     164m
cc-worker-z1             1/1     164m
database                 1/1     165m
diego-api-z0             1/1     164m
diego-api-z1             1/1     164m
diego-cell-z0            1/1     164m
diego-cell-z1            1/1     164m
doppler-z0               1/1     164m
doppler-z1               1/1     164m
log-api-z0               1/1     164m
log-api-z1               1/1     164m
log-cache                1/1     164m
nats-z0                  1/1     164m
nats-z1                  1/1     164m
router-z0                1/1     164m
router-z1                1/1     164m
scheduler-z0             1/1     164m
scheduler-z1             1/1     164m
singleton-blobstore-z0   1/1     164m
uaa-z0                   1/1     164m
uaa-z1                   1/1     164m

but from the cf-operator quarks-statefulset logs, there are many errors for statefulsets update, one of them is like below.

2021-01-07T03:54:45.947Z        ^[[31mERROR^[[0m        statefulset-rollout-reconciler  statefulset/statefulset_rollout_reconciler.go:172       Error updating StatefulSet 'kubecf/scheduler-z0' with partition move: Conflict while updating stateful set: Operation cannot be fulfilled on statefulsets.apps "scheduler-z0": the object has been modified; please apply your changes to the latest version and try again
code.cloudfoundry.org/quarks-statefulset/pkg/kube/controllers/statefulset.(*ReconcileStatefulSetRollout).Reconcile
        /go/src/code.cloudfoundry.org/quarks-statefulset/pkg/kube/controllers/statefulset/statefulset_rollout_reconciler.go:172
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /go/src/code.cloudfoundry.org/quarks-statefulset/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:244
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /go/src/code.cloudfoundry.org/quarks-statefulset/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:218
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
        /go/src/code.cloudfoundry.org/quarks-statefulset/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:197
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
        /go/src/code.cloudfoundry.org/quarks-statefulset/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
        /go/src/code.cloudfoundry.org/quarks-statefulset/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
        /go/src/code.cloudfoundry.org/quarks-statefulset/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.Until

seems the statefulsets not upgrade successfully.

step 3. run helm upgrade kubecf again and check the pod status

$ k get pod -n kubecf
NAME                                     READY   STATUS      RESTARTS   AGE
api-z0-0                                 17/17   Running     16         120m
api-z1-0                                 17/17   Running     16         120m
auctioneer-0                             6/6     Running     1          121m
cc-worker-z0-0                           6/6     Running     0          120m
cc-worker-z1-0                           6/6     Running     0          121m
cf-apps-dns-59f9f659f5-c4z5g             1/1     Running     0          125m
coredns-quarks-6db68476bd-4pjvh          1/1     Running     0          173m
coredns-quarks-6db68476bd-cdqtm          1/1     Running     0          173m
database-0                               2/2     Running     0          124m
database-seeder-10807f69f53c6e0f-hpxmr   0/2     Completed   0          46h
database-seeder-7a19efc54ebbb714-pqtbg   0/2     Completed   0          47d
database-seeder-c400c55f793c774c-5ltcb   0/2     Completed   0          3h38m
database-seeder-d49344d80353dd73-gmljj   0/2     Completed   0          47d
database-seeder-e421e917deb27f22-mps4r   0/2     Completed   0          173m
diego-api-z0-0                           9/9     Running     2          120m
diego-api-z1-0                           9/9     Running     2          120m
diego-cell-z0-0                          12/12   Running     9          120m
diego-cell-z1-0                          12/12   Running     9          120m
doppler-z0-0                             6/6     Running     0          121m
doppler-z1-0                             6/6     Running     0          121m
ian-nsenter-10.240.64.24                 0/1     Completed   0          7d
log-api-z0-0                             9/9     Running     0          120m
log-api-z1-0                             9/9     Running     0          120m
log-cache-0                              10/10   Running     0          121m
nats-z0-0                                7/7     Running     0          121m
nats-z1-0                                7/7     Running     0          121m
router-z0-0                              7/7     Running     0          121m
router-z1-0                              7/7     Running     0          121m
singleton-blobstore-z0-0                 8/8     Running     0          121m
uaa-z0-0                                 8/8     Running     0          121m
uaa-z1-0                                 8/8     Running     0          121m

no "scheduler" pods there.

$ k get statefulset -n kubecf
NAME                     READY   AGE
api-z0                   1/1     172m
api-z1                   1/1     172m
auctioneer               1/1     172m
cc-worker-z0             1/1     172m
cc-worker-z1             1/1     172m
database                 1/1     174m
diego-api-z0             1/1     172m
diego-api-z1             1/1     172m
diego-cell-z0            1/1     172m
diego-cell-z1            1/1     172m
doppler-z0               1/1     172m
doppler-z1               1/1     172m
log-api-z0               1/1     172m
log-api-z1               1/1     172m
log-cache                1/1     172m
nats-z0                  1/1     172m
nats-z1                  1/1     172m
router-z0                1/1     172m
router-z1                1/1     172m
scheduler-z0             0/1     172m   <<
scheduler-z1             0/1     172m   <<
singleton-blobstore-z0   1/1     173m
uaa-z0                   1/1     172m
uaa-z1                   1/1     172m

$ k describe statefulset scheduler-z0 -n kubecf and get below error:

Events:
  Type     Reason            Age                    From                          Message
  ----     ------            ----                   ----                          -------
  Normal   SuccessfulDelete  4m13s                  statefulset-controller        delete Pod scheduler-z0-0 in StatefulSet scheduler-z0 successful
  Normal   Predicates        3m56s (x17 over 173m)  statefulset-rollout-recorder  {"reconciliationObjectName":"scheduler-z0","reconciliationObjectKind":"StatefulSet","predicateObjectName":"scheduler-z0","predicateObjectKind":"StatefulSet","namespace":"kubecf","message":"Update predicate passed for 'kubecf/scheduler-z0' for statefulset rollout","type":"Predicates"}
  Warning  FailedCreate      65s (x16 over 3m54s)   statefulset-controller        create Pod scheduler-z0-0 in StatefulSet scheduler-z0 failed error: Pod "scheduler-z0-0" is invalid: spec.containers[0].image: Required value

check the log of cf-operator-quarks-statefulset,

2021-01-07T06:02:27.755Z        ^[[31mERROR^[[0m        statefulset-rollout-reconciler  statefulset/statefulset_rollout_util.go:65      Pods scheduler-z0-0 belonging to StatefulSet not foundkubecf/scheduler-z0:Pod "scheduler-z0-0" not found
code.cloudfoundry.org/quarks-statefulset/pkg/kube/controllers/statefulset.getPodWithIndex
        /go/src/code.cloudfoundry.org/quarks-statefulset/pkg/kube/controllers/statefulset/statefulset_rollout_util.go:65
code.cloudfoundry.org/quarks-statefulset/pkg/kube/controllers/statefulset.partitionPodIsReadyAndUpdated
        /go/src/code.cloudfoundry.org/quarks-statefulset/pkg/kube/controllers/statefulset/statefulset_rollout_reconciler.go:254
code.cloudfoundry.org/quarks-statefulset/pkg/kube/controllers/statefulset.(*ReconcileStatefulSetRollout).Reconcile
        /go/src/code.cloudfoundry.org/quarks-statefulset/pkg/kube/controllers/statefulset/statefulset_rollout_reconciler.go:130
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /go/src/code.cloudfoundry.org/quarks-statefulset/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:244
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /go/src/code.cloudfoundry.org/quarks-statefulset/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:218
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
        /go/src/code.cloudfoundry.org/quarks-statefulset/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:197
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
        /go/src/code.cloudfoundry.org/quarks-statefulset/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
        /go/src/code.cloudfoundry.org/quarks-statefulset/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
        /go/src/code.cloudfoundry.org/quarks-statefulset/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.Until
        /go/src/code.cloudfoundry.org/quarks-statefulset/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90
2021-01-07T06:02:27.755Z        ^[[35mDEBUG^[[0m        statefulset-rollout-reconciler  statefulset/statefulset_rollout_reconciler.go:165       Statefulset rollout for 'kubecf/scheduler-z0' is in status Canary

To Reproduce run helm upgrade from kubecf v2.6.1 (on Kubernetes 1.19) to kubecf-v2.7.1, then run helm upgrade again.

Expected behavior helm upgrade successfully with all pods running well.

Environment

  • CF Version: kubecf-v2.7.1
  • Component/Plugin Name
  • Others

ShuangMen avatar Jan 07 '21 03:01 ShuangMen

There are 2 issues found from above upgrade process.

  • after step 2, the statefulsets not upgrade successfully.

  • after step3, the scheduler's statefulset updated with incorrect and incomplete container "cc-deployment-updater-cc-deployment-updater" information, which cause the scheduler pods fail.

...
      containers:
      - imagePullPolicy: IfNotPresent
        name: cc-deployment-updater-cc-deployment-updater
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
...
  - apiVersion: apps/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:spec:
        f:template:
          f:spec:
            f:containers:
              k:{"name":"cc-deployment-updater-cc-deployment-updater"}:
                .: {}
                f:imagePullPolicy: {}
                f:name: {}
                f:resources: {}
                f:terminationMessagePath: {}
                f:terminationMessagePolicy: {}
    manager: kubectl-patch
    operation: Update
    time: "2021-01-07T05:58:09Z"
...

with 'cc_deployment_updater: false', no "cc-deployment-updater-cc-deployment-updater" should be created in scheduler pod.

https://github.com/cloudfoundry-incubator/kubecf/blob/master/chart/hooks/pre-upgrade/remove-deployment-updater-readiness.sh need to add a check before doing the patch to scheduler's statefulset.

ShuangMen avatar Jan 07 '21 07:01 ShuangMen

but from the cf-operator quarks-statefulset logs, there are many errors for statefulsets update

I cannot reproduce this issue; are you sure you updated cf-operator before updating kubecf, as 2.7.1 requires cf-operator 7.xx and 2.6.1 uses cf-operator 6.xx.

Anyways, your patch to check that the cf-deployment-update job exists before patching looks correct, so I've approved it, but it would be good to understand the root cause of your initial problem.

jandubois avatar Jan 07 '21 21:01 jandubois

Thanks for your comments @jandubois. yes, I did upgrade the cf-operator to the related version before upgrade kubecf.

For the issue

but from the cf-operator quarks-statefulset logs, there are many errors for statefulsets update

I will do more test and investigation.

ShuangMen avatar Jan 08 '21 07:01 ShuangMen

when I tried reproduce the issue from the cf-operator quarks-statefulset logs, there are many errors for statefulsets update I installed cf-operator v6.1.17-0.gec409fd7, and then try to install kubecf v2.6.1. there is no pod created in kubecf, I checked the log file of cf-operator and found below errors.

2021-01-12T03:53:21.328Z	ERROR	quarks-statefulset-reconciler	quarksstatefulset/quarksstatefulset_reconciler.go:147	Could not create StatefulSet for QuarksStatefulSet 'kubecf/database': could not create or update StatefulSet 'database' for QuarksStatefulSet 'kubecf/database': Internal error occurred: failed calling webhook "mutate-statefulsets.quarks.cloudfoundry.org": Post "https://qsts-webhook.cf-operator.svc:443/mutate-statefulsets?timeout=30s": service "qsts-webhook" not found
code.cloudfoundry.org/quarks-statefulset/pkg/kube/controllers/quarksstatefulset.(*ReconcileQuarksStatefulSet).Reconcile
	/go/pkg/mod/code.cloudfoundry.org/[email protected]/pkg/kube/controllers/quarksstatefulset/quarksstatefulset_reconciler.go:147
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:244
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:218
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:197
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.Until
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:90
2021-01-12T03:53:21.328Z	ERROR	controller	controller/controller.go:246	Reconciler error	{"controller": "quarks-statefulset-controller", "name": "database", "namespace": "kubecf", "error": "could not create StatefulSet for QuarksStatefulSet 'kubecf/database': could not create or update StatefulSet 'database' for QuarksStatefulSet 'kubecf/database': Internal error occurred: failed calling webhook \"mutate-statefulsets.quarks.cloudfoundry.org\": Post \"https://qsts-webhook.cf-operator.svc:443/mutate-statefulsets?timeout=30s\": service \"qsts-webhook\" not found"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:246
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:218
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:197
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.Until
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:90

ShuangMen avatar Jan 12 '21 07:01 ShuangMen

I think we're hitting the same issue with a fresh install of 2.7.1.

Post-deploy the scheduler has 0/1 pods ready. Running kubectl decsribe on the statefulset reveals the following:

Warning  FailedCreate      12m (x40 over 138m)   statefulset-controller        create Pod scheduler-0 in StatefulSet scheduler failed error: Pod "scheduler-0" is invalid: spec.containers[0].image: Required value
Warning  FailedCreate      4m47s (x17 over 10m)  statefulset-controller        create Pod scheduler-0 in StatefulSet scheduler failed error: Pod "scheduler-0" is invalid: spec.containers[0].image: Required value

And indeed when running get statefulset scheduler -n kubecf -o yaml we can see that the first element in the containers array is this, which is missing an image field:

- imagePullPolicy: IfNotPresent
  name: cc-deployment-updater-cc-deployment-updater
  resources: {}
  terminationMessagePath: /dev/termination-log
  terminationMessagePolicy: File

will-gant avatar Jan 12 '21 17:01 will-gant

As a followup on the above, we removed the bad cc-deployment-updater-cc-deployment-updater container and the scheduler StatefulSet deployed its Pod properly and things started working for us. Checking the ops-file ConfigMaps, I can see an operation for that job:

{
    "type": "remove",
    "path": "/instance_groups/name=scheduler/jobs/name=cc_deployment_updater"
}

so I would expect it to be removed but it appears that it is removing it enough that cf-operator isn't adding any of the interesting bits like the image but not removing it enough to actually drop the whole container. This is a non-HA deployment.

andy-paine avatar Jan 12 '21 17:01 andy-paine

for the cc_deployment_updater issue, the fix is merged https://github.com/cloudfoundry-incubator/kubecf/pull/1676. it will do a check before patch to to scheduler statefulset.

ShuangMen avatar Jan 13 '21 03:01 ShuangMen

Ah right. It would be good if a new release with that fix could be cut because as far as we can see, 2.7.1 doesn't actually work fully on a fresh install. Thanks for getting back to us @ShuangMen 🙂

andy-paine avatar Jan 13 '21 09:01 andy-paine

@andy-paine There should be a new release within the next few days.

jandubois avatar Jan 13 '21 19:01 jandubois