upgrade to kubecf-v2.7.1, scheduler pods disappeared.
Describe the bug step 1. kubecf v2.6.1 installed in Kubernetes (version 1.19.5_1529) step 2. run helm upgrade kubecf, try to upgrade kubecf to v2.7.1, with 'cc_deployment_updater: false' in file 'config/jobs.yaml'
after upgrade, check the statefulsets:
$k get statefulset -n kubecf
NAME READY AGE
api-z0 1/1 163m
api-z1 1/1 163m
auctioneer 1/1 164m
cc-worker-z0 1/1 164m
cc-worker-z1 1/1 164m
database 1/1 165m
diego-api-z0 1/1 164m
diego-api-z1 1/1 164m
diego-cell-z0 1/1 164m
diego-cell-z1 1/1 164m
doppler-z0 1/1 164m
doppler-z1 1/1 164m
log-api-z0 1/1 164m
log-api-z1 1/1 164m
log-cache 1/1 164m
nats-z0 1/1 164m
nats-z1 1/1 164m
router-z0 1/1 164m
router-z1 1/1 164m
scheduler-z0 1/1 164m
scheduler-z1 1/1 164m
singleton-blobstore-z0 1/1 164m
uaa-z0 1/1 164m
uaa-z1 1/1 164m
but from the cf-operator quarks-statefulset logs, there are many errors for statefulsets update, one of them is like below.
2021-01-07T03:54:45.947Z ^[[31mERROR^[[0m statefulset-rollout-reconciler statefulset/statefulset_rollout_reconciler.go:172 Error updating StatefulSet 'kubecf/scheduler-z0' with partition move: Conflict while updating stateful set: Operation cannot be fulfilled on statefulsets.apps "scheduler-z0": the object has been modified; please apply your changes to the latest version and try again
code.cloudfoundry.org/quarks-statefulset/pkg/kube/controllers/statefulset.(*ReconcileStatefulSetRollout).Reconcile
/go/src/code.cloudfoundry.org/quarks-statefulset/pkg/kube/controllers/statefulset/statefulset_rollout_reconciler.go:172
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/go/src/code.cloudfoundry.org/quarks-statefulset/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:244
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/go/src/code.cloudfoundry.org/quarks-statefulset/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:218
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
/go/src/code.cloudfoundry.org/quarks-statefulset/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:197
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
/go/src/code.cloudfoundry.org/quarks-statefulset/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
/go/src/code.cloudfoundry.org/quarks-statefulset/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
/go/src/code.cloudfoundry.org/quarks-statefulset/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.Until
seems the statefulsets not upgrade successfully.
step 3. run helm upgrade kubecf again and check the pod status
$ k get pod -n kubecf
NAME READY STATUS RESTARTS AGE
api-z0-0 17/17 Running 16 120m
api-z1-0 17/17 Running 16 120m
auctioneer-0 6/6 Running 1 121m
cc-worker-z0-0 6/6 Running 0 120m
cc-worker-z1-0 6/6 Running 0 121m
cf-apps-dns-59f9f659f5-c4z5g 1/1 Running 0 125m
coredns-quarks-6db68476bd-4pjvh 1/1 Running 0 173m
coredns-quarks-6db68476bd-cdqtm 1/1 Running 0 173m
database-0 2/2 Running 0 124m
database-seeder-10807f69f53c6e0f-hpxmr 0/2 Completed 0 46h
database-seeder-7a19efc54ebbb714-pqtbg 0/2 Completed 0 47d
database-seeder-c400c55f793c774c-5ltcb 0/2 Completed 0 3h38m
database-seeder-d49344d80353dd73-gmljj 0/2 Completed 0 47d
database-seeder-e421e917deb27f22-mps4r 0/2 Completed 0 173m
diego-api-z0-0 9/9 Running 2 120m
diego-api-z1-0 9/9 Running 2 120m
diego-cell-z0-0 12/12 Running 9 120m
diego-cell-z1-0 12/12 Running 9 120m
doppler-z0-0 6/6 Running 0 121m
doppler-z1-0 6/6 Running 0 121m
ian-nsenter-10.240.64.24 0/1 Completed 0 7d
log-api-z0-0 9/9 Running 0 120m
log-api-z1-0 9/9 Running 0 120m
log-cache-0 10/10 Running 0 121m
nats-z0-0 7/7 Running 0 121m
nats-z1-0 7/7 Running 0 121m
router-z0-0 7/7 Running 0 121m
router-z1-0 7/7 Running 0 121m
singleton-blobstore-z0-0 8/8 Running 0 121m
uaa-z0-0 8/8 Running 0 121m
uaa-z1-0 8/8 Running 0 121m
no "scheduler" pods there.
$ k get statefulset -n kubecf
NAME READY AGE
api-z0 1/1 172m
api-z1 1/1 172m
auctioneer 1/1 172m
cc-worker-z0 1/1 172m
cc-worker-z1 1/1 172m
database 1/1 174m
diego-api-z0 1/1 172m
diego-api-z1 1/1 172m
diego-cell-z0 1/1 172m
diego-cell-z1 1/1 172m
doppler-z0 1/1 172m
doppler-z1 1/1 172m
log-api-z0 1/1 172m
log-api-z1 1/1 172m
log-cache 1/1 172m
nats-z0 1/1 172m
nats-z1 1/1 172m
router-z0 1/1 172m
router-z1 1/1 172m
scheduler-z0 0/1 172m <<
scheduler-z1 0/1 172m <<
singleton-blobstore-z0 1/1 173m
uaa-z0 1/1 172m
uaa-z1 1/1 172m
$ k describe statefulset scheduler-z0 -n kubecf and get below error:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulDelete 4m13s statefulset-controller delete Pod scheduler-z0-0 in StatefulSet scheduler-z0 successful
Normal Predicates 3m56s (x17 over 173m) statefulset-rollout-recorder {"reconciliationObjectName":"scheduler-z0","reconciliationObjectKind":"StatefulSet","predicateObjectName":"scheduler-z0","predicateObjectKind":"StatefulSet","namespace":"kubecf","message":"Update predicate passed for 'kubecf/scheduler-z0' for statefulset rollout","type":"Predicates"}
Warning FailedCreate 65s (x16 over 3m54s) statefulset-controller create Pod scheduler-z0-0 in StatefulSet scheduler-z0 failed error: Pod "scheduler-z0-0" is invalid: spec.containers[0].image: Required value
check the log of cf-operator-quarks-statefulset,
2021-01-07T06:02:27.755Z ^[[31mERROR^[[0m statefulset-rollout-reconciler statefulset/statefulset_rollout_util.go:65 Pods scheduler-z0-0 belonging to StatefulSet not foundkubecf/scheduler-z0:Pod "scheduler-z0-0" not found
code.cloudfoundry.org/quarks-statefulset/pkg/kube/controllers/statefulset.getPodWithIndex
/go/src/code.cloudfoundry.org/quarks-statefulset/pkg/kube/controllers/statefulset/statefulset_rollout_util.go:65
code.cloudfoundry.org/quarks-statefulset/pkg/kube/controllers/statefulset.partitionPodIsReadyAndUpdated
/go/src/code.cloudfoundry.org/quarks-statefulset/pkg/kube/controllers/statefulset/statefulset_rollout_reconciler.go:254
code.cloudfoundry.org/quarks-statefulset/pkg/kube/controllers/statefulset.(*ReconcileStatefulSetRollout).Reconcile
/go/src/code.cloudfoundry.org/quarks-statefulset/pkg/kube/controllers/statefulset/statefulset_rollout_reconciler.go:130
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/go/src/code.cloudfoundry.org/quarks-statefulset/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:244
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/go/src/code.cloudfoundry.org/quarks-statefulset/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:218
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
/go/src/code.cloudfoundry.org/quarks-statefulset/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:197
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
/go/src/code.cloudfoundry.org/quarks-statefulset/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
/go/src/code.cloudfoundry.org/quarks-statefulset/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
/go/src/code.cloudfoundry.org/quarks-statefulset/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.Until
/go/src/code.cloudfoundry.org/quarks-statefulset/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90
2021-01-07T06:02:27.755Z ^[[35mDEBUG^[[0m statefulset-rollout-reconciler statefulset/statefulset_rollout_reconciler.go:165 Statefulset rollout for 'kubecf/scheduler-z0' is in status Canary
To Reproduce run helm upgrade from kubecf v2.6.1 (on Kubernetes 1.19) to kubecf-v2.7.1, then run helm upgrade again.
Expected behavior helm upgrade successfully with all pods running well.
Environment
- CF Version: kubecf-v2.7.1
- Component/Plugin Name
- Others
There are 2 issues found from above upgrade process.
-
after step 2, the statefulsets not upgrade successfully.
-
after step3, the scheduler's statefulset updated with incorrect and incomplete container "cc-deployment-updater-cc-deployment-updater" information, which cause the scheduler pods fail.
...
containers:
- imagePullPolicy: IfNotPresent
name: cc-deployment-updater-cc-deployment-updater
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
...
- apiVersion: apps/v1
fieldsType: FieldsV1
fieldsV1:
f:spec:
f:template:
f:spec:
f:containers:
k:{"name":"cc-deployment-updater-cc-deployment-updater"}:
.: {}
f:imagePullPolicy: {}
f:name: {}
f:resources: {}
f:terminationMessagePath: {}
f:terminationMessagePolicy: {}
manager: kubectl-patch
operation: Update
time: "2021-01-07T05:58:09Z"
...
with 'cc_deployment_updater: false', no "cc-deployment-updater-cc-deployment-updater" should be created in scheduler pod.
https://github.com/cloudfoundry-incubator/kubecf/blob/master/chart/hooks/pre-upgrade/remove-deployment-updater-readiness.sh need to add a check before doing the patch to scheduler's statefulset.
but from the cf-operator quarks-statefulset logs, there are many errors for statefulsets update
I cannot reproduce this issue; are you sure you updated cf-operator before updating kubecf, as 2.7.1 requires cf-operator 7.xx and 2.6.1 uses cf-operator 6.xx.
Anyways, your patch to check that the cf-deployment-update job exists before patching looks correct, so I've approved it, but it would be good to understand the root cause of your initial problem.
Thanks for your comments @jandubois. yes, I did upgrade the cf-operator to the related version before upgrade kubecf.
For the issue
but from the cf-operator quarks-statefulset logs, there are many errors for statefulsets update
I will do more test and investigation.
when I tried reproduce the issue from the cf-operator quarks-statefulset logs, there are many errors for statefulsets update
I installed cf-operator v6.1.17-0.gec409fd7, and then try to install kubecf v2.6.1.
there is no pod created in kubecf, I checked the log file of cf-operator and found below errors.
2021-01-12T03:53:21.328Z ERROR quarks-statefulset-reconciler quarksstatefulset/quarksstatefulset_reconciler.go:147 Could not create StatefulSet for QuarksStatefulSet 'kubecf/database': could not create or update StatefulSet 'database' for QuarksStatefulSet 'kubecf/database': Internal error occurred: failed calling webhook "mutate-statefulsets.quarks.cloudfoundry.org": Post "https://qsts-webhook.cf-operator.svc:443/mutate-statefulsets?timeout=30s": service "qsts-webhook" not found
code.cloudfoundry.org/quarks-statefulset/pkg/kube/controllers/quarksstatefulset.(*ReconcileQuarksStatefulSet).Reconcile
/go/pkg/mod/code.cloudfoundry.org/[email protected]/pkg/kube/controllers/quarksstatefulset/quarksstatefulset_reconciler.go:147
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:244
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:218
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:197
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.Until
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:90
2021-01-12T03:53:21.328Z ERROR controller controller/controller.go:246 Reconciler error {"controller": "quarks-statefulset-controller", "name": "database", "namespace": "kubecf", "error": "could not create StatefulSet for QuarksStatefulSet 'kubecf/database': could not create or update StatefulSet 'database' for QuarksStatefulSet 'kubecf/database': Internal error occurred: failed calling webhook \"mutate-statefulsets.quarks.cloudfoundry.org\": Post \"https://qsts-webhook.cf-operator.svc:443/mutate-statefulsets?timeout=30s\": service \"qsts-webhook\" not found"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:246
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:218
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:197
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.Until
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:90
I think we're hitting the same issue with a fresh install of 2.7.1.
Post-deploy the scheduler has 0/1 pods ready. Running kubectl decsribe on the statefulset reveals the following:
Warning FailedCreate 12m (x40 over 138m) statefulset-controller create Pod scheduler-0 in StatefulSet scheduler failed error: Pod "scheduler-0" is invalid: spec.containers[0].image: Required value
Warning FailedCreate 4m47s (x17 over 10m) statefulset-controller create Pod scheduler-0 in StatefulSet scheduler failed error: Pod "scheduler-0" is invalid: spec.containers[0].image: Required value
And indeed when running get statefulset scheduler -n kubecf -o yaml we can see that the first element in the containers array is this, which is missing an image field:
- imagePullPolicy: IfNotPresent
name: cc-deployment-updater-cc-deployment-updater
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
As a followup on the above, we removed the bad cc-deployment-updater-cc-deployment-updater container and the scheduler StatefulSet deployed its Pod properly and things started working for us. Checking the ops-file ConfigMaps, I can see an operation for that job:
{
"type": "remove",
"path": "/instance_groups/name=scheduler/jobs/name=cc_deployment_updater"
}
so I would expect it to be removed but it appears that it is removing it enough that cf-operator isn't adding any of the interesting bits like the image but not removing it enough to actually drop the whole container. This is a non-HA deployment.
for the cc_deployment_updater issue, the fix is merged https://github.com/cloudfoundry-incubator/kubecf/pull/1676. it will do a check before patch to to scheduler statefulset.
Ah right. It would be good if a new release with that fix could be cut because as far as we can see, 2.7.1 doesn't actually work fully on a fresh install. Thanks for getting back to us @ShuangMen 🙂
@andy-paine There should be a new release within the next few days.