scylla-operator
scylla-operator copied to clipboard
[Flake] ScyllaCluster [It] should support scaling
https://github.com/scylladb/scylla-operator/runs/6098102031?check_suite_focus=true#step:12:828
2022-04-20T16:26:15.5863099Z Unexpected error:
2022-04-20T16:26:15.5875913Z <*errors.errorString | 0xc00029aea0>: {
2022-04-20T16:26:15.5907697Z s: "timed out waiting for the condition",
2022-04-20T16:26:15.5908547Z }
2022-04-20T16:26:15.5942141Z timed out waiting for the condition
2022-04-20T16:26:15.5942777Z occurred
2022-04-20T16:26:15.5988318Z In [It] at: github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster/scyllacluster_scaling.go:172
We've hit this flake on a PR branch (#973) introducing conntrack-cleaner
DaemonSet, so bear in mind it might have been caused by removal of TCP conntrack entries, although at this point I'm quite confident it's unrelated.
Scylla Pod basic-q25mg-us-east-1-us-east-1a-1
fails with
2022-04-20T16:24:50.128071219Z ERROR 2022-04-20 16:24:50,127 [shard 0] init - Startup failed: std::runtime_error (This node was decommissioned and will not rejoin the ring unless override_decommission=true has been set,or all existing data is removed and the node is bootstrapped again)
What we've found so far (cc @zimnx):
An error logged by the operator during scaledown:
E0420 15:59:23.721881 1 scyllacluster/status.go:112] only last member of each rack should be decommissioning, but 2-th member of us-east-1a found decommissioning while rack had 2 members
Audit log:
jq -C 'select(.requestObject != null and .objectRef.namespace == "e2e-test-scyllacluster-frb8h-c5jdt" and (.objectRef.resource == "pods" or .objectRef.resource == "persistentvolumeclaims") and (.verb == "delete" or .verb == "create" or .verb == "patch")) | select(.responseObject | ( .metadata.namespace=="e2e-test-scyllacluster-frb8h-c5jdt") ) | "\(.requestReceivedTimestamp) \(.responseObject.kind) \(.responseObject.metadata.name) \(.verb)))"' kube-apiserver-audit.log |& less -R
"2022-04-20T15:49:00.665289Z PersistentVolumeClaim data-basic-q25mg-us-east-1-us-east-1a-0 create))"
"2022-04-20T15:49:00.674049Z Pod basic-q25mg-us-east-1-us-east-1a-0 create))"
"2022-04-20T15:49:00.712274Z Pod basic-q25mg-us-east-1-us-east-1a-0 patch))"
"2022-04-20T15:49:04.227469Z Pod basic-q25mg-us-east-1-us-east-1a-0 patch))"
"2022-04-20T15:49:12.850698Z Pod basic-q25mg-us-east-1-us-east-1a-0 patch))"
"2022-04-20T15:49:30.428492Z Pod basic-q25mg-us-east-1-us-east-1a-0 patch))"
"2022-04-20T15:49:33.028337Z Pod basic-q25mg-us-east-1-us-east-1a-0 patch))"
"2022-04-20T15:49:53.831287Z Pod basic-q25mg-us-east-1-us-east-1a-0 patch))"
"2022-04-20T15:50:22.703295Z Pod basic-q25mg-us-east-1-us-east-1a-0 patch))"
"2022-04-20T15:50:55.930098Z PersistentVolumeClaim data-basic-q25mg-us-east-1-us-east-1a-1 create))"
"2022-04-20T15:50:56.013056Z Pod basic-q25mg-us-east-1-us-east-1a-1 create))"
"2022-04-20T15:50:56.039772Z Pod basic-q25mg-us-east-1-us-east-1a-1 patch))"
"2022-04-20T15:52:39.141011Z Pod basic-q25mg-us-east-1-us-east-1a-1 patch))"
"2022-04-20T15:52:50.543184Z Pod basic-q25mg-us-east-1-us-east-1a-1 patch))"
"2022-04-20T15:53:01.143736Z Pod basic-q25mg-us-east-1-us-east-1a-1 patch))"
"2022-04-20T15:53:02.743794Z Pod basic-q25mg-us-east-1-us-east-1a-1 patch))"
"2022-04-20T15:53:36.369526Z Pod basic-q25mg-us-east-1-us-east-1a-1 patch))"
"2022-04-20T15:55:26.273638Z Pod basic-q25mg-us-east-1-us-east-1a-1 patch))"
"2022-04-20T15:55:26.307124Z PersistentVolumeClaim data-basic-q25mg-us-east-1-us-east-1a-2 create))"
"2022-04-20T15:55:26.329697Z Pod basic-q25mg-us-east-1-us-east-1a-2 create))"
"2022-04-20T15:55:26.347700Z Pod basic-q25mg-us-east-1-us-east-1a-2 patch))"
"2022-04-20T15:55:36.302299Z Pod basic-q25mg-us-east-1-us-east-1a-2 patch))"
"2022-04-20T15:55:41.084628Z Pod basic-q25mg-us-east-1-us-east-1a-2 patch))"
"2022-04-20T15:55:53.729845Z Pod basic-q25mg-us-east-1-us-east-1a-2 patch))"
"2022-04-20T15:55:57.338199Z Pod basic-q25mg-us-east-1-us-east-1a-2 patch))"
"2022-04-20T15:56:16.515293Z Pod basic-q25mg-us-east-1-us-east-1a-2 patch))"
"2022-04-20T15:57:56.305048Z Pod basic-q25mg-us-east-1-us-east-1a-2 patch))"
"2022-04-20T15:58:26.845648Z Pod basic-q25mg-us-east-1-us-east-1a-2 patch))"
"2022-04-20T15:59:23.649979Z Pod basic-q25mg-us-east-1-us-east-1a-2 delete))"
"2022-04-20T15:59:26.549812Z Pod basic-q25mg-us-east-1-us-east-1a-2 patch))"
"2022-04-20T15:59:26.592306Z Pod basic-q25mg-us-east-1-us-east-1a-2 delete))"
"2022-04-20T15:59:33.736194Z PersistentVolumeClaim data-basic-q25mg-us-east-1-us-east-1a-2 delete))"
"2022-04-20T15:59:43.942403Z Pod basic-q25mg-us-east-1-us-east-1a-1 patch))"
"2022-04-20T15:59:44.107274Z Pod basic-q25mg-us-east-1-us-east-1a-1 patch))"
"2022-04-20T15:59:45.861643Z Pod basic-q25mg-us-east-1-us-east-1a-1 patch))"
"2022-04-20T15:59:46.277592Z Pod basic-q25mg-us-east-1-us-east-1a-1 patch))"
"2022-04-20T15:59:50.074942Z Pod basic-q25mg-us-east-1-us-east-1a-1 patch))"
"2022-04-20T16:01:28.283630Z Pod basic-q25mg-us-east-1-us-east-1a-1 delete))"
"2022-04-20T16:01:30.573241Z Pod basic-q25mg-us-east-1-us-east-1a-1 patch))"
"2022-04-20T16:01:30.609214Z Pod basic-q25mg-us-east-1-us-east-1a-1 delete))"
"2022-04-20T16:01:38.655117Z PersistentVolumeClaim data-basic-q25mg-us-east-1-us-east-1a-1 delete))"
"2022-04-20T16:01:49.041244Z PersistentVolumeClaim data-basic-q25mg-us-east-1-us-east-1a-1 create))"
"2022-04-20T16:01:49.065119Z Pod basic-q25mg-us-east-1-us-east-1a-1 create))"
"2022-04-20T16:01:49.105351Z Pod basic-q25mg-us-east-1-us-east-1a-1 patch))"
"2022-04-20T16:01:50.435651Z Pod basic-q25mg-us-east-1-us-east-1a-1 patch))"
"2022-04-20T16:01:52.506772Z Pod basic-q25mg-us-east-1-us-east-1a-1 patch))"
"2022-04-20T16:02:04.152209Z Pod basic-q25mg-us-east-1-us-east-1a-1 patch))"
"2022-04-20T16:02:05.219779Z Pod basic-q25mg-us-east-1-us-east-1a-1 patch))"
"2022-04-20T16:02:10.444144Z Pod basic-q25mg-us-east-1-us-east-1a-1 patch))"
"2022-04-20T16:25:03.036427Z Pod basic-q25mg-us-east-1-us-east-1a-0 patch))"
"2022-04-20T16:25:03.288236Z Pod basic-q25mg-us-east-1-us-east-1a-0 delete))"
"2022-04-20T16:25:07.127599Z Pod basic-q25mg-us-east-1-us-east-1a-1 patch))"
"2022-04-20T16:25:07.431384Z Pod basic-q25mg-us-east-1-us-east-1a-1 delete))"
storage-provisioner.yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"labels":{"addonmanager.kubernetes.io/mode":"Reconcile","integration-test":"storage-provisioner"},"name":"storage-provisioner","namespace":"kube-system"},"spec":{"containers":[{"command":["/storage-provisioner"],"image":"gcr.io/k8s-minikube/storage-provisioner:v5","imagePullPolicy":"IfNotPresent","name":"storage-provisioner","volumeMounts":[{"mountPath":"/tmp","name":"tmp"}]}],"hostNetwork":true,"serviceAccountName":"storage-provisioner","volumes":[{"hostPath":{"path":"/tmp","type":"Directory"},"name":"tmp"}]}}
creationTimestamp: "2022-04-20T15:45:27Z"
labels:
addonmanager.kubernetes.io/mode: Reconcile
integration-test: storage-provisioner
name: storage-provisioner
namespace: kube-system
resourceVersion: "5463"
uid: c58226ec-8255-484a-b0a4-edfc30ed2d26
spec:
containers:
- command:
- /storage-provisioner
image: gcr.io/k8s-minikube/storage-provisioner:v5
imagePullPolicy: IfNotPresent
name: storage-provisioner
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /tmp
name: tmp
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-2qf8x
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
hostNetwork: true
nodeName: fv-az180-796
preemptionPolicy: PreemptLowerPriority
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: storage-provisioner
serviceAccountName: storage-provisioner
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- hostPath:
path: /tmp
type: Directory
name: tmp
- name: kube-api-access-2qf8x
projected:
defaultMode: 420
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
- configMap:
items:
- key: ca.crt
path: ca.crt
name: kube-root-ca.crt
- downwardAPI:
items:
- fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
path: namespace
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2022-04-20T15:45:27Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2022-04-20T15:55:15Z"
status: "True"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2022-04-20T15:55:15Z"
status: "True"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2022-04-20T15:45:27Z"
status: "True"
type: PodScheduled
containerStatuses:
- containerID: docker://0912254df14bfb68bce9892c4d12fddb741cf957ed6dcb1df25b665ad2762e25
image: gcr.io/k8s-minikube/storage-provisioner:v5
imageID: docker-pullable://gcr.io/k8s-minikube/storage-provisioner@sha256:18eb69d1418e854ad5a19e399310e52808a8321e4c441c1dddad8977a0d7a944
lastState:
terminated:
containerID: docker://325016c30b91389a7f4a2f439ee367490cac7763ca11355268ed41cfb90bb8a3
exitCode: 255
finishedAt: "2022-04-20T15:53:49Z"
reason: Error
startedAt: "2022-04-20T15:52:05Z"
name: storage-provisioner
ready: true
restartCount: 5
started: true
state:
running:
startedAt: "2022-04-20T15:55:13Z"
hostIP: 10.1.0.135
phase: Running
podIP: 10.1.0.135
podIPs:
- ip: 10.1.0.135
qosClass: BestEffort
startTime: "2022-04-20T15:45:27Z"
storage-provisioner.logs
I0420 16:01:38.875727 1 controller.go:1478] delete "pvc-b11deda4-b78b-4ecc-ad84-ce7ccd6975a0": volume deletion ignored: ignored because identity annotation on PV does not match ours
It seems that after data-basic-q25mg-us-east-1-us-east-1a-1
PVC was deleted, storage-provisioner didn't delete the PV and hence storage wasn't cleared. After scaling up, data-basic-q25mg-us-east-1-us-east-1a-1
was successfully provisioned again.
Possibly related: Deleted hostpath PVs stuck in released state after minikube restart #4546
If you look at the audit logs, can you dump the PV to see why the provissioner thinks differently? https://github.com/kubernetes/minikube/blob/f567893cdbb8e98a0348c1d93455af71cf6ce12c/pkg/storage/storage_provisioner.go#L97-L112
yq '.items.[] | select(.spec.claimRef | (.namespace == "e2e-test-scyllacluster-frb8h-c5jdt" )) | [.]' persistentvolumes.yaml
- apiVersion: v1
kind: PersistentVolume
metadata:
annotations:
hostPathProvisionerIdentity: 32a4dd97-e96a-4482-8dc2-8d12dbecde78
pv.kubernetes.io/provisioned-by: k8s.io/minikube-hostpath
creationTimestamp: "2022-04-20T15:52:25Z"
finalizers:
- kubernetes.io/pv-protection
name: pvc-b11deda4-b78b-4ecc-ad84-ce7ccd6975a0
resourceVersion: "8188"
uid: 174d7932-73cb-4be7-82ea-c080d8441768
spec:
accessModes:
- ReadWriteOnce
capacity:
storage: 100Mi
claimRef:
apiVersion: v1
kind: PersistentVolumeClaim
name: data-basic-q25mg-us-east-1-us-east-1a-1
namespace: e2e-test-scyllacluster-frb8h-c5jdt
resourceVersion: "3197"
uid: b11deda4-b78b-4ecc-ad84-ce7ccd6975a0
hostPath:
path: /tmp/hostpath-provisioner/e2e-test-scyllacluster-frb8h-c5jdt/data-basic-q25mg-us-east-1-us-east-1a-1
type: ""
persistentVolumeReclaimPolicy: Delete
storageClassName: standard
volumeMode: Filesystem
status:
phase: Released
- apiVersion: v1
kind: PersistentVolume
metadata:
annotations:
hostPathProvisionerIdentity: 69c3750d-ffae-41a3-99bf-94ec19737ca7
pv.kubernetes.io/provisioned-by: k8s.io/minikube-hostpath
creationTimestamp: "2022-04-20T15:49:00Z"
finalizers:
- kubernetes.io/pv-protection
name: pvc-e18bc16e-4a9c-4c55-8755-7212a3a2fa15
resourceVersion: "12566"
uid: 13ce2d8e-2943-46c1-8f36-d9eecd8cb075
spec:
accessModes:
- ReadWriteOnce
capacity:
storage: 100Mi
claimRef:
apiVersion: v1
kind: PersistentVolumeClaim
name: data-basic-q25mg-us-east-1-us-east-1a-0
namespace: e2e-test-scyllacluster-frb8h-c5jdt
resourceVersion: "1482"
uid: e18bc16e-4a9c-4c55-8755-7212a3a2fa15
hostPath:
path: /tmp/hostpath-provisioner/e2e-test-scyllacluster-frb8h-c5jdt/data-basic-q25mg-us-east-1-us-east-1a-0
type: ""
persistentVolumeReclaimPolicy: Delete
storageClassName: standard
volumeMode: Filesystem
status:
phase: Released
seems like the provisioner uses random UUID that by definition won't work with a restart https://github.com/kubernetes/minikube/blob/f567893cdbb8e98a0348c1d93455af71cf6ce12c/pkg/storage/storage_provisioner.go#L52 looking at the logs, what was the cause a a very first restart of the controller?
Discussed offline, but just for the record: we are not gathering logs from previous iterations of containers running in non-e2e namespaces, so best I could find was container statuses from the audit log.
"containerStatuses": [
{
"name": "storage-provisioner",
"state": {
"running": {
"startedAt": "2022-04-20T15:55:13Z"
}
},
"lastState": {
"terminated": {
"exitCode": 255,
"reason": "Error",
"startedAt": "2022-04-20T15:52:05Z",
"finishedAt": "2022-04-20T15:53:49Z",
"containerID": "docker://325016c30b91389a7f4a2f439ee367490cac7763ca11355268ed41cfb90bb8a3"
}
},
"ready": true,
"restartCount": 5,
"image": "gcr.io/k8s-minikube/storage-provisioner:v5",
"imageID": "docker-pullable://gcr.io/k8s-minikube/storage-provisioner@sha256:18eb69d1418e854ad5a19e399310e52808a8321e4c441c1dddad8977a0d7a944",
"containerID": "docker://0912254df14bfb68bce9892c4d12fddb741cf957ed6dcb1df25b665ad2762e25",
"started": true
}
],
I'm sending in a PR (#976) to gather logs from previous iterations and reproduce the issue.
Reproduced in Go #3252
logs_6239.zip e2e_artifacts.tar.lz4
I0428 09:35:16.761321 1 controller.go:1478] delete "pvc-2f1cc092-2e44-41be-a62f-60f0a0210548": volume deletion ignored: ignored because identity an
notation on PV does not match ours
I0428 09:35:24.860347 1 volume_store.go:219] persistentvolume "pvc-48d971c5-dcf6-405f-90ff-0267a1340f6c" saved
I0428 09:35:24.868034 1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"e2e-test-scyllacluster-8z8qz-9dm9m", Name:"
data-basic-2ffbz-us-east-1-us-east-1a-2", UID:"48d971c5-dcf6-405f-90ff-0267a1340f6c", APIVersion:"v1", ResourceVersion:"5277", FieldPath:""}): type: 'Nor
mal' reason: 'ProvisioningSucceeded' Successfully provisioned volume pvc-48d971c5-dcf6-405f-90ff-0267a1340f6c
E0428 09:35:27.200117 1 leaderelection.go:361] Failed to update lock: context deadline exceeded
I0428 09:35:27.200341 1 leaderelection.go:278] failed to renew lease kube-system/k8s.io-minikube-hostpath: timed out waiting for the condition
F0428 09:35:27.204318 1 controller.go:913] leaderelection lost
goroutine 1 [running]:
k8s.io/klog/v2.stacks(0xc00000e001, 0xc000428160, 0x45, 0xb0)
/Users/medya/go/pkg/mod/k8s.io/klog/[email protected]/klog.go:1021 +0xb9
k8s.io/klog/v2.(*loggingT).output(0x220df80, 0xc000000003, 0x0, 0x0, 0xc000179b20, 0x1baa9bd, 0xd, 0x391, 0x0)
/Users/medya/go/pkg/mod/k8s.io/klog/[email protected]/klog.go:970 +0x191
k8s.io/klog/v2.(*loggingT).printf(0x220df80, 0x3, 0x0, 0x0, 0x0, 0x0, 0x1701df5, 0x13, 0x0, 0x0, ...)
/Users/medya/go/pkg/mod/k8s.io/klog/[email protected]/klog.go:751 +0x191
k8s.io/klog/v2.Fatalf(...)
/Users/medya/go/pkg/mod/k8s.io/klog/[email protected]/klog.go:1509
sigs.k8s.io/sig-storage-lib-external-provisioner/v6/controller.(*ProvisionController).Run.func2()
/Users/medya/go/pkg/mod/sigs.k8s.io/sig-storage-lib-external-provisioner/[email protected]/controller/controller.go:913 +0x8f
k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run.func1(0xc000420480)
/Users/medya/go/pkg/mod/k8s.io/[email protected]/tools/leaderelection/leaderelection.go:199 +0x29
k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run(0xc000420480, 0x18e5568, 0xc00031ed00)
/Users/medya/go/pkg/mod/k8s.io/[email protected]/tools/leaderelection/leaderelection.go:209 +0x167
k8s.io/client-go/tools/leaderelection.RunOrDie(0x18e5568, 0xc000044018, 0x18ed938, 0xc0000c1400, 0x37e11d600, 0x2540be400, 0x77359400, 0xc000296840, 0x17a0ff8, 0x0, ...)
/Users/medya/go/pkg/mod/k8s.io/[email protected]/tools/leaderelection/leaderelection.go:222 +0x9f
sigs.k8s.io/sig-storage-lib-external-provisioner/v6/controller.(*ProvisionController).Run(0xc0000c2f00, 0x18e5568, 0xc000044018)
/Users/medya/go/pkg/mod/sigs.k8s.io/sig-storage-lib-external-provisioner/[email protected]/controller/controller.go:905 +0x49c
k8s.io/minikube/pkg/storage.StartStorageProvisioner(0x1707ad8, 0x19, 0x0, 0x0)
/Users/medya/workspace/minikube/pkg/storage/storage_provisioner.go:142 +0x41d
main.main()
/Users/medya/workspace/minikube/cmd/storage-provisioner/main.go:38 +0x146
storage-provisioner
Pod is restarted as it looses leader election due to exceeding the time limit. minikube/storage-provisioner
has leader election turned on by default (see: https://github.com/kubernetes/minikube/blob/dfd73aed384e10772f3a70970f487450d0630e74/pkg/storage/storage_provisioner.go#L139 and https://github.com/kubernetes-sigs/sig-storage-lib-external-provisioner/blob/5fd51cfade8c066f42fc33300c3fe700b0f67cae/controller/controller.go#L212) and I don't see a way to disable it atm.
Similar issue: https://github.com/kubernetes-retired/external-storage/issues/1184
@tnozicka any thoughts on this?
to sum it up here, this is waiting for the new scylla local storage provisioner https://github.com/scylladb/scylla-operator/pull/979#issuecomment-1145676235
https://github.com/scylladb/scylla-operator/actions/runs/3393685823/jobs/5641361477#step:3:1649
to sum it up here, this is waiting for the new scylla local storage provisioner https://github.com/scylladb/scylla-operator/pull/979#issuecomment-1145676235
We have already switched to the local storage provisioner in https://github.com/scylladb/scylla-operator/pull/1106
https://github.com/scylladb/scylla-operator/blob/5b04e01ca7c5e3320067c4e6545fa74a0da79985/.github/actions/run-e2e/action.yaml#L62-L66
@rzetelskik can we close this?
@tnozicka the original issue should be mitigated now but I believe I've seen this test being flaky quite recently, although most likely for a different reason. To keep the discussion clean I think we could close it anyway and open a separate issue or reopen this one if the e2e fails.
yep, let's open a new one for a different kind of failure