kube-arangodb icon indicating copy to clipboard operation
kube-arangodb copied to clipboard

Fails at vertical scale-up

Open nervetattoo opened this issue 6 years ago • 1 comments

I've attempted twice to scale vertically by setting up a new node pool and shifting all pods over to the new pool.

Steps

  1. Create new node pool in GKE with gcloud container node-pools create prod --cluster=production --machine-type=n1-standard-2 --local-ssd-count=1 --node-version=1.10.4-gke.2 --num-nodes=3
  2. Cordon all smaller nodes: kubectl get nodes; kubectl cordon <node_N>; ...
  3. Scale up db servers in arango UI from 2 to 4
  4. Drain smaller nodes one by one with kubectl drain --force --ignore-daemonsets --delete-local-data --grace-period=10 <node-id>

Symptoms

This is when the trouble starts. This causes unschedulable pods: The final node hangs midway in draining, presumably on a crdn pod or agnt pod.

Hanging drain command:

node "gke-production-default-pool-1cf7a994-q288" already cordoned
WARNING: Deleting pods with local storage: arangodb-production-agnt-pamhyjwn-2a0a73, arangodb-production-crdn-9tzkulma-2a0a73; Ignoring DaemonSet-managed pods: arangodb-storage-rbx6l, fluentd-gcp-v3.0.0-4wm66, metadata-agent-k62qq
pod "metrics-server-v0.2.1-7486f5bd67-nrn87" evicted
pod "arango-storage-operator-54cd4d8c44-cq6jt" evicted
pod "arango-deployment-operator-797544f86d-c2sd8" evicted
pod "kube-dns-788979dc8f-dqsx6" evicted
pod "arangodb-production-crdn-9tzkulma-2a0a73" evicted
...

Status from GKE UI: screen shot 2018-06-25 at 11 04 36

Status

kubectl get pods

NAME                                          READY     STATUS        RESTARTS   AGE
arango-deployment-operator-797544f86d-6rg7n   0/1       Running       0          20m
arango-deployment-operator-797544f86d-wx86p   1/1       Running       0          7m
arangodb-production-agnt-8bx3peom-117183      1/1       Running       0          14m
arangodb-production-agnt-pamhyjwn-2a0a73      0/1       Terminating   0          51m
arangodb-production-crdn-x0hbybza-117183      1/1       Running       0          20m
arangodb-production-prmr-bqjwrevu-117183      0/1       Pending       0          10m
arangodb-production-prmr-bqqyrex5-117183      0/1       Pending       0          9m
arangodb-production-prmr-heb9sgby-117183      1/1       Running       0          21m
arangodb-production-prmr-j27vjj72-117183      1/1       Running       0          21m
arangodb-production-prmr-thqajevu-856789      1/1       Running       0          22m
arangodb-production-prmr-vqliboce-117183      0/1       Pending       0          11m

kubectl get pvc

NAME                                    STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
arangodb-production-agent-8bx3peom      Bound     arangodb-storage-86f723-cu1h0npt6aadbh93   8Gi        RWO            my-local-ssd   15m
arangodb-production-agent-msfwqpeu      Bound     arangodb-storage-066b2a-wwjd5cjzxjes5mjo   8Gi        RWO            my-local-ssd   11m
arangodb-production-agent-pamhyjwn      Bound     arangodb-storage-f317e9-gzyylticflyzongi   8Gi        RWO            my-local-ssd   52m
arangodb-production-dbserver-bqjwrevu   Bound     arangodb-storage-70c9a7-avugtee51nmeoome   8Gi        RWO            my-local-ssd   52m
arangodb-production-dbserver-bqqyrex5   Pending                                                                        my-local-ssd   10m
arangodb-production-dbserver-heb9sgby   Bound     arangodb-storage-39286a-9fmwoar6lsaeputi   8Gi        RWO            my-local-ssd   21m
arangodb-production-dbserver-j27vjj72   Bound     arangodb-storage-86f723-sckfgsy0rfxgpg3q   8Gi        RWO            my-local-ssd   21m
arangodb-production-dbserver-thqajevu   Bound     arangodb-storage-066b2a-vznsaonaranghf73   8Gi        RWO            my-local-ssd   22m
arangodb-production-dbserver-vqliboce   Bound     arangodb-storage-f317e9-uze1q46ijndyilrw   8Gi        RWO            my-local-ssd   12m

kubectl get pv

NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM                                           STORAGECLASS   REASON    AGE
arangodb-storage-066b2a-vznsaonaranghf73   8Gi        RWO            Retain           Bound       default/arangodb-production-dbserver-thqajevu   my-local-ssd             22m
arangodb-storage-066b2a-wwjd5cjzxjes5mjo   8Gi        RWO            Retain           Bound       default/arangodb-production-agent-msfwqpeu      my-local-ssd             10m
arangodb-storage-096d88-8udjn4jp6vgo2h8i   8Gi        RWO            Retain           Available   default/arangodb-production-dbserver-dgmynbdw   my-local-ssd             15m
arangodb-storage-39286a-9fmwoar6lsaeputi   8Gi        RWO            Retain           Bound       default/arangodb-production-dbserver-heb9sgby   my-local-ssd             21m
arangodb-storage-70c9a7-avugtee51nmeoome   8Gi        RWO            Retain           Bound       default/arangodb-production-dbserver-bqjwrevu   my-local-ssd             52m
arangodb-storage-86f723-cu1h0npt6aadbh93   8Gi        RWO            Retain           Bound       default/arangodb-production-agent-8bx3peom      my-local-ssd             20m
arangodb-storage-86f723-sckfgsy0rfxgpg3q   8Gi        RWO            Retain           Bound       default/arangodb-production-dbserver-j27vjj72   my-local-ssd             21m
arangodb-storage-f317e9-gzyylticflyzongi   8Gi        RWO            Retain           Bound       default/arangodb-production-agent-pamhyjwn      my-local-ssd             52m
arangodb-storage-f317e9-uze1q46ijndyilrw   8Gi        RWO            Retain           Bound       default/arangodb-production-dbserver-vqliboce   my-local-ssd             11m

Logs

Deployment operator

2018-06-25T08:54:49Z |DEBU| Updating member condition Terminated to true: Pod Failed component=deployment deployment=arangodb-production pod-name=arangodb-production-agnt-msfwqpeu-117183
2018-06-25T08:54:49Z |DEBU| Updating member condition Ready to false component=deployment deployment=arangodb-production pod-name=arangodb-production-agnt-msfwqpeu-117183
2018-06-25T08:54:49Z |DEBU| Updating member condition Terminated to true: Pod Succeeded component=deployment deployment=arangodb-production pod-name=arangodb-production-agnt-pamhyjwn-2a0a73
2018-06-25T08:54:49Z |DEBU| Updating member condition Ready to false component=deployment deployment=arangodb-production pod-name=arangodb-production-agnt-pamhyjwn-2a0a73
2018-06-25T08:54:49Z |DEBU| Inspecting agency-serving finalizer component=deployment deployment=arangodb-production pod-name=arangodb-production-agnt-pamhyjwn-2a0a73
2018-06-25T08:54:49Z |DEBU| Agent data will be gone, so we will check agency serving status first component=deployment deployment=arangodb-production pod-name=arangodb-production-agnt-pamhyjwn-2a0a73
2018-06-25T08:54:53Z |DEBU| Remaining agents are not health component=deployment deployment=arangodb-production error="Agent http://arangodb-production-agent-msfwqpeu.arangodb-production-int.default.svc:8529 is not responding" pod-name=arangodb-production-agnt-pamhyjwn-2a0a73
2018-06-25T08:54:53Z |DEBU| Cannot remove finalizer yet component=deployment deployment=arangodb-production error="Agent http://arangodb-production-agent-msfwqpeu.arangodb-production-int.default.svc:8529 is not responding" finalizer=agent.database.arangodb.com/agency-serving pod-name=arangodb-production-agnt-pamhyjwn-2a0a73
2018-06-25T08:54:53Z |DEBU| Updating member condition Terminated to true: Pod Succeeded component=deployment deployment=arangodb-production pod-name=arangodb-production-crdn-9tzkulma-2a0a73
2018-06-25T08:54:53Z |DEBU| Updating member condition Ready to false component=deployment deployment=arangodb-production pod-name=arangodb-production-crdn-9tzkulma-2a0a73
2018-06-25T08:54:53Z |DEBU| Pod scheduling timeout component=deployment deployment=arangodb-production pod-name=arangodb-production-prmr-bqjwrevu-117183
2018-06-25T08:54:53Z |DEBU| Pod scheduling timeout component=deployment deployment=arangodb-production pod-name=arangodb-production-prmr-bqqyrex5-117183
2018-06-25T08:54:53Z |DEBU| no memberstatus found for pod component=deployment deployment=arangodb-production pod=arangodb-production-prmr-j27vjj72-117183
2018-06-25T08:54:53Z |DEBU| no memberstatus found for pod component=deployment deployment=arangodb-production pod=arangodb-production-prmr-thqajevu-856789
2018-06-25T08:54:53Z |DEBU| Pod scheduling timeout component=deployment deployment=arangodb-production pod-name=arangodb-production-prmr-vqliboce-117183
2018-06-25T08:54:53Z |DEBU| no memberstatus found for PVC component=deployment deployment=arangodb-production pvc=arangodb-production-dbserver-j27vjj72
2018-06-25T08:54:53Z |DEBU| no memberstatus found for PVC component=deployment deployment=arangodb-production pvc=arangodb-production-dbserver-thqajevu
2018-06-25T08:57:04Z |INFO| Event(v1.ObjectReference{Kind:"ArangoDeployment", Namespace:"default", Name:"arangodb-production", UID:"1f50c43a-784f-11e8-ab05-42010aa40092", APIVersion:"database.arangodb.com", ResourceVersion:"8681", FieldPath:""}): type: 'Warning' reason: 'Removed Member Cleanup Failed' Failed to retrieve clusterId node from agency!
2018-06-25T08:57:04Z |DEBU| Cleanup terminated pod component=deployment deployment=arangodb-production pod-name=arangodb-production-agnt-msfwqpeu-117183
2018-06-25T08:57:04Z |DEBU| Cleanup terminated pod component=deployment deployment=arangodb-production pod-name=arangodb-production-agnt-pamhyjwn-2a0a73
2018-06-25T08:57:04Z |DEBU| Inspecting agency-serving finalizer component=deployment deployment=arangodb-production pod-name=arangodb-production-agnt-msfwqpeu-117183
2018-06-25T08:57:04Z |DEBU| Pod is just being restarted, safe to remove agency serving finalizer component=deployment deployment=arangodb-production pod-name=arangodb-production-agnt-msfwqpeu-117183
2018-06-25T08:57:04Z |DEBU| Inspecting agency-serving finalizer component=deployment deployment=arangodb-production pod-name=arangodb-production-agnt-pamhyjwn-2a0a73
2018-06-25T08:57:05Z |DEBU| Agent data will be gone, so we will check agency serving status first component=deployment deployment=arangodb-production pod-name=arangodb-production-agnt-pamhyjwn-2a0a73
2018-06-25T08:57:08Z |DEBU| Remaining agents are not health component=deployment deployment=arangodb-production error="Agent http://arangodb-production-agent-msfwqpeu.arangodb-production-int.default.svc:8529 is not responding" pod-name=arangodb-production-agnt-pamhyjwn-2a0a73
2018-06-25T08:57:08Z |DEBU| Cannot remove finalizer yet component=deployment deployment=arangodb-production error="Agent http://arangodb-production-agent-msfwqpeu.arangodb-production-int.default.svc:8529 is not responding" finalizer=agent.database.arangodb.com/agency-serving pod-name=arangodb-production-agnt-pamhyjwn-2a0a73
2018-06-25T08:57:08Z |DEBU| Pod scheduling timeout component=deployment deployment=arangodb-production pod-name=arangodb-production-prmr-bqjwrevu-117183
2018-06-25T08:57:08Z |DEBU| Pod scheduling timeout component=deployment deployment=arangodb-production pod-name=arangodb-production-prmr-bqqyrex5-117183
2018-06-25T08:57:08Z |DEBU| no memberstatus found for pod component=deployment deployment=arangodb-production pod=arangodb-production-prmr-j27vjj72-117183
2018-06-25T08:57:08Z |DEBU| no memberstatus found for pod component=deployment deployment=arangodb-production pod=arangodb-production-prmr-thqajevu-856789
2018-06-25T08:57:08Z |DEBU| Pod scheduling timeout component=deployment deployment=arangodb-production pod-name=arangodb-production-prmr-vqliboce-117183
2018-06-25T08:57:08Z |DEBU| Pod is gone component=deployment deployment=arangodb-production pod-name=arangodb-production-crdn-9tzkulma-2a0a73
2018-06-25T08:57:08Z |INFO| Event(v1.ObjectReference{Kind:"ArangoDeployment", Namespace:"default", Name:"arangodb-production", UID:"1f50c43a-784f-11e8-ab05-42010aa40092", APIVersion:"database.arangodb.com", ResourceVersion:"8681", FieldPath:""}): type: 'Normal' reason: 'Pod Of Coordinator Gone' Pod arangodb-production-crdn-9tzkulma-2a0a73 of member coordinator is gone
2018-06-25T08:57:08Z |DEBU| no memberstatus found for PVC component=deployment deployment=arangodb-production pvc=arangodb-production-dbserver-j27vjj72
2018-06-25T08:57:08Z |DEBU| no memberstatus found for PVC component=deployment deployment=arangodb-production pvc=arangodb-production-dbserver-thqajevu
2018-06-25T08:58:54Z |DEBU| Failed to get number of servers component=deployment deployment=arangodb-production error="Cannot read from agency."
2018-06-25T08:58:54Z |DEBU| Cluster inspection failed component=deployment deployment=arangodb-production error="Cannot read from agency."
2018-06-25T09:00:57Z |DEBU| Failed to get number of servers component=deployment deployment=arangodb-production error="Cannot read from agency."
2018-06-25T09:00:57Z |DEBU| Cluster inspection failed component=deployment deployment=arangodb-production error="Cannot read from agency."
2018-06-25T09:03:02Z |DEBU| Failed to get number of servers component=deployment deployment=arangodb-production error="Cannot read from agency."
2018-06-25T09:03:02Z |DEBU| Cluster inspection failed component=deployment deployment=arangodb-production error="Cannot read from agency."
2018-06-25T09:05:08Z |DEBU| Failed to get number of servers component=deployment deployment=arangodb-production error="Cannot read from agency."
2018-06-25T09:05:08Z |DEBU| Cluster inspection failed component=deployment deployment=arangodb-production error="Cannot read from agency."
2018-06-25T09:06:08Z |WARN| Member is not ready for long time, but it is not safe to mark it a failed because: Cannot fetch databases: Get http://arangodb-production.default.svc:8529/_db/_system/_api/database: context deadline exceeded component=deployment deployment=arangodb-production id=PRMR-bqjwrevu role=dbserver

Storage operator

2018-06-25T08:51:09Z |DEBU| Updated DaemonSet component=storage localStorage=arangodb-storage
2018-06-25T08:52:06Z |DEBU| Created PersistentVolume component=storage local-path=/mnt/disks/ssd0/fjfzudv3jjanzria local-path-root=/mnt/disks/ssd0 localStorage=arangodb-storage name=arangodb-storage-70c9a7-fjfzudv3jjanzria node-name=gke-production-default-pool-1cf7a994-b462
2018-06-25T08:52:06Z |ERRO| Failed to create PersistentVolume component=storage error="PersistentVolumeClaim 'arangodb-production-agent-msfwqpeu' no longer needs a volume" localStorage=arangodb-storage
2018-06-25T08:52:26Z |DEBU| Added PersistentVolume to cleaner component=storage localStorage=arangodb-storage name=arangodb-storage-096d88-lepriuvg0fxkvcpd
2018-06-25T08:52:26Z |DEBU| Cleaning PersistentVolume component=storage localStorage=arangodb-storage name=arangodb-storage-096d88-lepriuvg0fxkvcpd
2018-06-25T08:52:26Z |DEBU| Added PersistentVolume to cleaner component=storage localStorage=arangodb-storage name=arangodb-storage-096d88-lepriuvg0fxkvcpd
2018-06-25T08:52:26Z |DEBU| Cleaning PersistentVolume component=storage localStorage=arangodb-storage name=arangodb-storage-096d88-lepriuvg0fxkvcpd
2018-06-25T08:53:26Z |ERRO| Failed to create PersistentVolume component=storage error="No more nodes available" localStorage=arangodb-storage
2018-06-25T08:54:26Z |ERRO| Failed to create PersistentVolume component=storage error="No more nodes available" localStorage=arangodb-storage
2018-06-25T08:55:26Z |ERRO| Failed to create PersistentVolume component=storage error="No more nodes available" localStorage=arangodb-storage
2018-06-25T08:56:26Z |ERRO| Failed to create PersistentVolume component=storage error="No more nodes available" localStorage=arangodb-storage
2018-06-25T08:57:27Z |ERRO| Failed to create PersistentVolume component=storage error="No more nodes available" localStorage=arangodb-storage
2018-06-25T08:58:27Z |ERRO| Failed to create PersistentVolume component=storage error="No more nodes available" localStorage=arangodb-storage
2018-06-25T08:59:27Z |ERRO| Failed to create PersistentVolume component=storage error="No more nodes available" localStorage=arangodb-storage
2018-06-25T09:00:27Z |ERRO| Failed to create PersistentVolume component=storage error="No more nodes available" localStorage=arangodb-storage
2018-06-25T09:01:27Z |ERRO| Failed to create PersistentVolume component=storage error="No more nodes available" localStorage=arangodb-storage
2018-06-25T09:02:27Z |ERRO| Failed to create PersistentVolume component=storage error="No more nodes available" localStorage=arangodb-storage
2018-06-25T09:03:27Z |ERRO| Failed to create PersistentVolume component=storage error="No more nodes available" localStorage=arangodb-storage
2018-06-25T09:04:27Z |ERRO| Failed to create PersistentVolume component=storage error="No more nodes available" localStorage=arangodb-storage
2018-06-25T09:05:27Z |ERRO| Failed to create PersistentVolume component=storage error="No more nodes available" localStorage=arangodb-storage
2018-06-25T09:06:27Z |ERRO| Failed to create PersistentVolume component=storage error="No more nodes available" localStorage=arangodb-storage
2018-06-25T09:07:27Z |ERRO| Failed to create PersistentVolume component=storage error="No more nodes available" localStorage=arangodb-storage
2018-06-25T09:08:27Z |ERRO| Failed to create PersistentVolume component=storage error="No more nodes available" localStorage=arangodb-storage

nervetattoo avatar Jun 25 '18 09:06 nervetattoo

When I check back a couple of hours later I see this status:

screen shot 2018-06-25 at 15 19 56

It has been trying to reschedule and balance the cluster but it completely fails at re-instating the agency. Right now there is 1 erred out pod, 1 finished and 1 in normal operation. The one in normal operation is set as a FOLLOWER, so its clearly badly broken.

Could it be a matter of timing? As in, should I have patiently waited for 1 node to be fully decommissioned with all pods successfully rescheduled before draining the next? That should hardly matter for anything related to PV(C) but if the agency loses all 3 existing pods as its attempting a new consensus with old addresses, I guess that might get all out of sync and unable to re-negotiate?

nervetattoo avatar Jun 25 '18 13:06 nervetattoo