kube-arangodb
kube-arangodb copied to clipboard
Fails at vertical scale-up
I've attempted twice to scale vertically by setting up a new node pool and shifting all pods over to the new pool.
Steps
- Create new node pool in GKE with
gcloud container node-pools create prod --cluster=production --machine-type=n1-standard-2 --local-ssd-count=1 --node-version=1.10.4-gke.2 --num-nodes=3
- Cordon all smaller nodes:
kubectl get nodes; kubectl cordon <node_N>; ...
- Scale up db servers in arango UI from 2 to 4
- Drain smaller nodes one by one with
kubectl drain --force --ignore-daemonsets --delete-local-data --grace-period=10 <node-id>
Symptoms
This is when the trouble starts. This causes unschedulable pods: The final node hangs midway in draining, presumably on a crdn pod or agnt pod.
Hanging drain command:
node "gke-production-default-pool-1cf7a994-q288" already cordoned
WARNING: Deleting pods with local storage: arangodb-production-agnt-pamhyjwn-2a0a73, arangodb-production-crdn-9tzkulma-2a0a73; Ignoring DaemonSet-managed pods: arangodb-storage-rbx6l, fluentd-gcp-v3.0.0-4wm66, metadata-agent-k62qq
pod "metrics-server-v0.2.1-7486f5bd67-nrn87" evicted
pod "arango-storage-operator-54cd4d8c44-cq6jt" evicted
pod "arango-deployment-operator-797544f86d-c2sd8" evicted
pod "kube-dns-788979dc8f-dqsx6" evicted
pod "arangodb-production-crdn-9tzkulma-2a0a73" evicted
...
Status from GKE UI:
Status
kubectl get pods
NAME READY STATUS RESTARTS AGE
arango-deployment-operator-797544f86d-6rg7n 0/1 Running 0 20m
arango-deployment-operator-797544f86d-wx86p 1/1 Running 0 7m
arangodb-production-agnt-8bx3peom-117183 1/1 Running 0 14m
arangodb-production-agnt-pamhyjwn-2a0a73 0/1 Terminating 0 51m
arangodb-production-crdn-x0hbybza-117183 1/1 Running 0 20m
arangodb-production-prmr-bqjwrevu-117183 0/1 Pending 0 10m
arangodb-production-prmr-bqqyrex5-117183 0/1 Pending 0 9m
arangodb-production-prmr-heb9sgby-117183 1/1 Running 0 21m
arangodb-production-prmr-j27vjj72-117183 1/1 Running 0 21m
arangodb-production-prmr-thqajevu-856789 1/1 Running 0 22m
arangodb-production-prmr-vqliboce-117183 0/1 Pending 0 11m
kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
arangodb-production-agent-8bx3peom Bound arangodb-storage-86f723-cu1h0npt6aadbh93 8Gi RWO my-local-ssd 15m
arangodb-production-agent-msfwqpeu Bound arangodb-storage-066b2a-wwjd5cjzxjes5mjo 8Gi RWO my-local-ssd 11m
arangodb-production-agent-pamhyjwn Bound arangodb-storage-f317e9-gzyylticflyzongi 8Gi RWO my-local-ssd 52m
arangodb-production-dbserver-bqjwrevu Bound arangodb-storage-70c9a7-avugtee51nmeoome 8Gi RWO my-local-ssd 52m
arangodb-production-dbserver-bqqyrex5 Pending my-local-ssd 10m
arangodb-production-dbserver-heb9sgby Bound arangodb-storage-39286a-9fmwoar6lsaeputi 8Gi RWO my-local-ssd 21m
arangodb-production-dbserver-j27vjj72 Bound arangodb-storage-86f723-sckfgsy0rfxgpg3q 8Gi RWO my-local-ssd 21m
arangodb-production-dbserver-thqajevu Bound arangodb-storage-066b2a-vznsaonaranghf73 8Gi RWO my-local-ssd 22m
arangodb-production-dbserver-vqliboce Bound arangodb-storage-f317e9-uze1q46ijndyilrw 8Gi RWO my-local-ssd 12m
kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
arangodb-storage-066b2a-vznsaonaranghf73 8Gi RWO Retain Bound default/arangodb-production-dbserver-thqajevu my-local-ssd 22m
arangodb-storage-066b2a-wwjd5cjzxjes5mjo 8Gi RWO Retain Bound default/arangodb-production-agent-msfwqpeu my-local-ssd 10m
arangodb-storage-096d88-8udjn4jp6vgo2h8i 8Gi RWO Retain Available default/arangodb-production-dbserver-dgmynbdw my-local-ssd 15m
arangodb-storage-39286a-9fmwoar6lsaeputi 8Gi RWO Retain Bound default/arangodb-production-dbserver-heb9sgby my-local-ssd 21m
arangodb-storage-70c9a7-avugtee51nmeoome 8Gi RWO Retain Bound default/arangodb-production-dbserver-bqjwrevu my-local-ssd 52m
arangodb-storage-86f723-cu1h0npt6aadbh93 8Gi RWO Retain Bound default/arangodb-production-agent-8bx3peom my-local-ssd 20m
arangodb-storage-86f723-sckfgsy0rfxgpg3q 8Gi RWO Retain Bound default/arangodb-production-dbserver-j27vjj72 my-local-ssd 21m
arangodb-storage-f317e9-gzyylticflyzongi 8Gi RWO Retain Bound default/arangodb-production-agent-pamhyjwn my-local-ssd 52m
arangodb-storage-f317e9-uze1q46ijndyilrw 8Gi RWO Retain Bound default/arangodb-production-dbserver-vqliboce my-local-ssd 11m
Logs
Deployment operator
2018-06-25T08:54:49Z |DEBU| Updating member condition Terminated to true: Pod Failed component=deployment deployment=arangodb-production pod-name=arangodb-production-agnt-msfwqpeu-117183
2018-06-25T08:54:49Z |DEBU| Updating member condition Ready to false component=deployment deployment=arangodb-production pod-name=arangodb-production-agnt-msfwqpeu-117183
2018-06-25T08:54:49Z |DEBU| Updating member condition Terminated to true: Pod Succeeded component=deployment deployment=arangodb-production pod-name=arangodb-production-agnt-pamhyjwn-2a0a73
2018-06-25T08:54:49Z |DEBU| Updating member condition Ready to false component=deployment deployment=arangodb-production pod-name=arangodb-production-agnt-pamhyjwn-2a0a73
2018-06-25T08:54:49Z |DEBU| Inspecting agency-serving finalizer component=deployment deployment=arangodb-production pod-name=arangodb-production-agnt-pamhyjwn-2a0a73
2018-06-25T08:54:49Z |DEBU| Agent data will be gone, so we will check agency serving status first component=deployment deployment=arangodb-production pod-name=arangodb-production-agnt-pamhyjwn-2a0a73
2018-06-25T08:54:53Z |DEBU| Remaining agents are not health component=deployment deployment=arangodb-production error="Agent http://arangodb-production-agent-msfwqpeu.arangodb-production-int.default.svc:8529 is not responding" pod-name=arangodb-production-agnt-pamhyjwn-2a0a73
2018-06-25T08:54:53Z |DEBU| Cannot remove finalizer yet component=deployment deployment=arangodb-production error="Agent http://arangodb-production-agent-msfwqpeu.arangodb-production-int.default.svc:8529 is not responding" finalizer=agent.database.arangodb.com/agency-serving pod-name=arangodb-production-agnt-pamhyjwn-2a0a73
2018-06-25T08:54:53Z |DEBU| Updating member condition Terminated to true: Pod Succeeded component=deployment deployment=arangodb-production pod-name=arangodb-production-crdn-9tzkulma-2a0a73
2018-06-25T08:54:53Z |DEBU| Updating member condition Ready to false component=deployment deployment=arangodb-production pod-name=arangodb-production-crdn-9tzkulma-2a0a73
2018-06-25T08:54:53Z |DEBU| Pod scheduling timeout component=deployment deployment=arangodb-production pod-name=arangodb-production-prmr-bqjwrevu-117183
2018-06-25T08:54:53Z |DEBU| Pod scheduling timeout component=deployment deployment=arangodb-production pod-name=arangodb-production-prmr-bqqyrex5-117183
2018-06-25T08:54:53Z |DEBU| no memberstatus found for pod component=deployment deployment=arangodb-production pod=arangodb-production-prmr-j27vjj72-117183
2018-06-25T08:54:53Z |DEBU| no memberstatus found for pod component=deployment deployment=arangodb-production pod=arangodb-production-prmr-thqajevu-856789
2018-06-25T08:54:53Z |DEBU| Pod scheduling timeout component=deployment deployment=arangodb-production pod-name=arangodb-production-prmr-vqliboce-117183
2018-06-25T08:54:53Z |DEBU| no memberstatus found for PVC component=deployment deployment=arangodb-production pvc=arangodb-production-dbserver-j27vjj72
2018-06-25T08:54:53Z |DEBU| no memberstatus found for PVC component=deployment deployment=arangodb-production pvc=arangodb-production-dbserver-thqajevu
2018-06-25T08:57:04Z |INFO| Event(v1.ObjectReference{Kind:"ArangoDeployment", Namespace:"default", Name:"arangodb-production", UID:"1f50c43a-784f-11e8-ab05-42010aa40092", APIVersion:"database.arangodb.com", ResourceVersion:"8681", FieldPath:""}): type: 'Warning' reason: 'Removed Member Cleanup Failed' Failed to retrieve clusterId node from agency!
2018-06-25T08:57:04Z |DEBU| Cleanup terminated pod component=deployment deployment=arangodb-production pod-name=arangodb-production-agnt-msfwqpeu-117183
2018-06-25T08:57:04Z |DEBU| Cleanup terminated pod component=deployment deployment=arangodb-production pod-name=arangodb-production-agnt-pamhyjwn-2a0a73
2018-06-25T08:57:04Z |DEBU| Inspecting agency-serving finalizer component=deployment deployment=arangodb-production pod-name=arangodb-production-agnt-msfwqpeu-117183
2018-06-25T08:57:04Z |DEBU| Pod is just being restarted, safe to remove agency serving finalizer component=deployment deployment=arangodb-production pod-name=arangodb-production-agnt-msfwqpeu-117183
2018-06-25T08:57:04Z |DEBU| Inspecting agency-serving finalizer component=deployment deployment=arangodb-production pod-name=arangodb-production-agnt-pamhyjwn-2a0a73
2018-06-25T08:57:05Z |DEBU| Agent data will be gone, so we will check agency serving status first component=deployment deployment=arangodb-production pod-name=arangodb-production-agnt-pamhyjwn-2a0a73
2018-06-25T08:57:08Z |DEBU| Remaining agents are not health component=deployment deployment=arangodb-production error="Agent http://arangodb-production-agent-msfwqpeu.arangodb-production-int.default.svc:8529 is not responding" pod-name=arangodb-production-agnt-pamhyjwn-2a0a73
2018-06-25T08:57:08Z |DEBU| Cannot remove finalizer yet component=deployment deployment=arangodb-production error="Agent http://arangodb-production-agent-msfwqpeu.arangodb-production-int.default.svc:8529 is not responding" finalizer=agent.database.arangodb.com/agency-serving pod-name=arangodb-production-agnt-pamhyjwn-2a0a73
2018-06-25T08:57:08Z |DEBU| Pod scheduling timeout component=deployment deployment=arangodb-production pod-name=arangodb-production-prmr-bqjwrevu-117183
2018-06-25T08:57:08Z |DEBU| Pod scheduling timeout component=deployment deployment=arangodb-production pod-name=arangodb-production-prmr-bqqyrex5-117183
2018-06-25T08:57:08Z |DEBU| no memberstatus found for pod component=deployment deployment=arangodb-production pod=arangodb-production-prmr-j27vjj72-117183
2018-06-25T08:57:08Z |DEBU| no memberstatus found for pod component=deployment deployment=arangodb-production pod=arangodb-production-prmr-thqajevu-856789
2018-06-25T08:57:08Z |DEBU| Pod scheduling timeout component=deployment deployment=arangodb-production pod-name=arangodb-production-prmr-vqliboce-117183
2018-06-25T08:57:08Z |DEBU| Pod is gone component=deployment deployment=arangodb-production pod-name=arangodb-production-crdn-9tzkulma-2a0a73
2018-06-25T08:57:08Z |INFO| Event(v1.ObjectReference{Kind:"ArangoDeployment", Namespace:"default", Name:"arangodb-production", UID:"1f50c43a-784f-11e8-ab05-42010aa40092", APIVersion:"database.arangodb.com", ResourceVersion:"8681", FieldPath:""}): type: 'Normal' reason: 'Pod Of Coordinator Gone' Pod arangodb-production-crdn-9tzkulma-2a0a73 of member coordinator is gone
2018-06-25T08:57:08Z |DEBU| no memberstatus found for PVC component=deployment deployment=arangodb-production pvc=arangodb-production-dbserver-j27vjj72
2018-06-25T08:57:08Z |DEBU| no memberstatus found for PVC component=deployment deployment=arangodb-production pvc=arangodb-production-dbserver-thqajevu
2018-06-25T08:58:54Z |DEBU| Failed to get number of servers component=deployment deployment=arangodb-production error="Cannot read from agency."
2018-06-25T08:58:54Z |DEBU| Cluster inspection failed component=deployment deployment=arangodb-production error="Cannot read from agency."
2018-06-25T09:00:57Z |DEBU| Failed to get number of servers component=deployment deployment=arangodb-production error="Cannot read from agency."
2018-06-25T09:00:57Z |DEBU| Cluster inspection failed component=deployment deployment=arangodb-production error="Cannot read from agency."
2018-06-25T09:03:02Z |DEBU| Failed to get number of servers component=deployment deployment=arangodb-production error="Cannot read from agency."
2018-06-25T09:03:02Z |DEBU| Cluster inspection failed component=deployment deployment=arangodb-production error="Cannot read from agency."
2018-06-25T09:05:08Z |DEBU| Failed to get number of servers component=deployment deployment=arangodb-production error="Cannot read from agency."
2018-06-25T09:05:08Z |DEBU| Cluster inspection failed component=deployment deployment=arangodb-production error="Cannot read from agency."
2018-06-25T09:06:08Z |WARN| Member is not ready for long time, but it is not safe to mark it a failed because: Cannot fetch databases: Get http://arangodb-production.default.svc:8529/_db/_system/_api/database: context deadline exceeded component=deployment deployment=arangodb-production id=PRMR-bqjwrevu role=dbserver
Storage operator
2018-06-25T08:51:09Z |DEBU| Updated DaemonSet component=storage localStorage=arangodb-storage
2018-06-25T08:52:06Z |DEBU| Created PersistentVolume component=storage local-path=/mnt/disks/ssd0/fjfzudv3jjanzria local-path-root=/mnt/disks/ssd0 localStorage=arangodb-storage name=arangodb-storage-70c9a7-fjfzudv3jjanzria node-name=gke-production-default-pool-1cf7a994-b462
2018-06-25T08:52:06Z |ERRO| Failed to create PersistentVolume component=storage error="PersistentVolumeClaim 'arangodb-production-agent-msfwqpeu' no longer needs a volume" localStorage=arangodb-storage
2018-06-25T08:52:26Z |DEBU| Added PersistentVolume to cleaner component=storage localStorage=arangodb-storage name=arangodb-storage-096d88-lepriuvg0fxkvcpd
2018-06-25T08:52:26Z |DEBU| Cleaning PersistentVolume component=storage localStorage=arangodb-storage name=arangodb-storage-096d88-lepriuvg0fxkvcpd
2018-06-25T08:52:26Z |DEBU| Added PersistentVolume to cleaner component=storage localStorage=arangodb-storage name=arangodb-storage-096d88-lepriuvg0fxkvcpd
2018-06-25T08:52:26Z |DEBU| Cleaning PersistentVolume component=storage localStorage=arangodb-storage name=arangodb-storage-096d88-lepriuvg0fxkvcpd
2018-06-25T08:53:26Z |ERRO| Failed to create PersistentVolume component=storage error="No more nodes available" localStorage=arangodb-storage
2018-06-25T08:54:26Z |ERRO| Failed to create PersistentVolume component=storage error="No more nodes available" localStorage=arangodb-storage
2018-06-25T08:55:26Z |ERRO| Failed to create PersistentVolume component=storage error="No more nodes available" localStorage=arangodb-storage
2018-06-25T08:56:26Z |ERRO| Failed to create PersistentVolume component=storage error="No more nodes available" localStorage=arangodb-storage
2018-06-25T08:57:27Z |ERRO| Failed to create PersistentVolume component=storage error="No more nodes available" localStorage=arangodb-storage
2018-06-25T08:58:27Z |ERRO| Failed to create PersistentVolume component=storage error="No more nodes available" localStorage=arangodb-storage
2018-06-25T08:59:27Z |ERRO| Failed to create PersistentVolume component=storage error="No more nodes available" localStorage=arangodb-storage
2018-06-25T09:00:27Z |ERRO| Failed to create PersistentVolume component=storage error="No more nodes available" localStorage=arangodb-storage
2018-06-25T09:01:27Z |ERRO| Failed to create PersistentVolume component=storage error="No more nodes available" localStorage=arangodb-storage
2018-06-25T09:02:27Z |ERRO| Failed to create PersistentVolume component=storage error="No more nodes available" localStorage=arangodb-storage
2018-06-25T09:03:27Z |ERRO| Failed to create PersistentVolume component=storage error="No more nodes available" localStorage=arangodb-storage
2018-06-25T09:04:27Z |ERRO| Failed to create PersistentVolume component=storage error="No more nodes available" localStorage=arangodb-storage
2018-06-25T09:05:27Z |ERRO| Failed to create PersistentVolume component=storage error="No more nodes available" localStorage=arangodb-storage
2018-06-25T09:06:27Z |ERRO| Failed to create PersistentVolume component=storage error="No more nodes available" localStorage=arangodb-storage
2018-06-25T09:07:27Z |ERRO| Failed to create PersistentVolume component=storage error="No more nodes available" localStorage=arangodb-storage
2018-06-25T09:08:27Z |ERRO| Failed to create PersistentVolume component=storage error="No more nodes available" localStorage=arangodb-storage
When I check back a couple of hours later I see this status:
![screen shot 2018-06-25 at 15 19 56](https://user-images.githubusercontent.com/45449/41852399-4bb2dd72-788b-11e8-9c44-0713eb026e68.png)
It has been trying to reschedule and balance the cluster but it completely fails at re-instating the agency. Right now there is 1 erred out pod, 1 finished and 1 in normal operation. The one in normal operation is set as a FOLLOWER, so its clearly badly broken.
Could it be a matter of timing? As in, should I have patiently waited for 1 node to be fully decommissioned with all pods successfully rescheduled before draining the next? That should hardly matter for anything related to PV(C) but if the agency loses all 3 existing pods as its attempting a new consensus with old addresses, I guess that might get all out of sync and unable to re-negotiate?