pangeo-cloud-federation
pangeo-cloud-federation copied to clipboard
aws-node NetworkPlugin cni failures on AWS Hubs
For some time we've been noticing network related errors on AWS hubs when scaling up from zero nodes. This hasn't been too much of a concern because things usually recover on their own, for example, launching the first user-notebook on the cluster we see (and the user sees these messages while their jupyterhub session is loading). Note in particular FailedCreatePodSandBox
followed by SandboxChanged
:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal TriggeredScaleUp 9m26s cluster-autoscaler pod triggered scale-up: [{eksctl-pangeo-nasa-nodegroup-user-notebook-v2-NodeGroup-8VRX95CW3EOF 0->1 (max: 100)}]
Warning FailedScheduling 8m20s (x10 over 9m37s) nasa-staging-user-scheduler 0/1 nodes are available: 1 node(s) didn't match node selector.
Warning FailedScheduling 8m (x7 over 8m18s) nasa-staging-user-scheduler 0/2 nodes are available: 1 node(s) didn't match node selector, 1 node(s) were not ready.
Warning FailedScheduling 7m58s nasa-staging-user-scheduler 0/2 nodes are available: 1 node(s) didn't match node selector, 1 node(s) had taints that the pod didn't tolerate.
Normal Scheduled 7m58s nasa-staging-user-scheduler Successfully assigned nasa-staging/jupyter-scottyhq to ip-192-168-57-24.ec2.internal
Warning FailedCreatePodSandBox 7m56s kubelet, ip-192-168-57-24.ec2.internal Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "8e556785fa45174bcc4e05721247f4c1cbecf8581b7763f1c34f522d3fac1266" network for pod "jupyter-scottyhq": NetworkPlugin cni failed to set up pod "jupyter-scottyhq_nasa-staging" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused", failed to clean up sandbox container "8e556785fa45174bcc4e05721247f4c1cbecf8581b7763f1c34f522d3fac1266" network for pod "jupyter-scottyhq": NetworkPlugin cni failed to teardown pod "jupyter-scottyhq_nasa-staging" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"]
Normal SandboxChanged 6m34s (x8 over 7m55s) kubelet, ip-192-168-57-24.ec2.internal Pod sandbox changed, it will be killed and re-created.
Normal Pulling 6m34s kubelet, ip-192-168-57-24.ec2.internal pulling image "busybox"
Normal Pulled 6m33s kubelet, ip-192-168-57-24.ec2.internal Successfully pulled image "busybox"
Normal Created 6m33s kubelet, ip-192-168-57-24.ec2.internal Created container
Normal Started 6m33s kubelet, ip-192-168-57-24.ec2.internal Started container
Normal Pulling 6m33s kubelet, ip-192-168-57-24.ec2.internal pulling image "pangeo/base-notebook:latest"
Normal Pulled 6m1s kubelet, ip-192-168-57-24.ec2.internal Successfully pulled image "pangeo/base-notebook:latest"
Normal Created 5m10s kubelet, ip-192-168-57-24.ec2.internal Created container
Normal Started 5m10s kubelet, ip-192-168-57-24.ec2.internal Started container
(aws) scott@pangeo-cloud-federation:kubectl describe pod dask-scottyhq-cf1fb8da-0tzpk5 -n nasa-staging
But, when the first dask worker pods are launched they get stuck in Pending
state and there is no recovery from the error:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 5m1s (x2 over 5m1s) default-scheduler 0/2 nodes are available: 1 Insufficient cpu, 1 Insufficient memory, 1 node(s) had taints that the pod didn't tolerate.
Normal TriggeredScaleUp 4m50s cluster-autoscaler pod triggered scale-up: [{eksctl-pangeo-nasa-nodegroup-dask-worker-v2-NodeGroup-7BQ1Z5FRVX6H 0->1 (max: 100)}]
Warning FailedScheduling 3m31s (x4 over 3m51s) default-scheduler 0/3 nodes are available: 1 Insufficient cpu, 1 Insufficient memory, 2 node(s) had taints that the pod didn't tolerate.
Normal Scheduled 3m31s default-scheduler Successfully assigned nasa-staging/dask-scottyhq-cf1fb8da-0tzpk5 to ip-192-168-55-40.ec2.internal
Warning FailedCreatePodSandBox 3m29s kubelet, ip-192-168-55-40.ec2.internal Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "a12c84d5290fdfaeda86834eb552f9690ccfd24501e2e6d8091eeb596aae8b8b" network for pod "dask-scottyhq-cf1fb8da-0tzpk5": NetworkPlugin cni failed to set up pod "dask-scottyhq-cf1fb8da-0tzpk5_nasa-staging" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused", failed to clean up sandbox container "a12c84d5290fdfaeda86834eb552f9690ccfd24501e2e6d8091eeb596aae8b8b" network for pod "dask-scottyhq-cf1fb8da-0tzpk5": NetworkPlugin cni failed to teardown pod "dask-scottyhq-cf1fb8da-0tzpk5_nasa-staging" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"]
A current non-ideal workaround is to shutdown the first cluster in the notebook and launch a new one. Since the node is already up at that point, things seem to work.
@jacobtomlinson - do you possibly have some insight here?
This might be related.. long start up times are partly due to continuous-image-puller being enabled - When new user or dask nodes are added all of the images in profileList are pulled. I think image pulling happens after the cni errors described above though, so isn't the causing the SandBox errors.
When a new user node launches you can see two image-puller pods simultaneously pulling all the images (on each staging and prod namespace), resulting in ~10min startup times for nodes:
kubectl get pods --all-namespaces -o wide
nasa-prod continuous-image-puller-mt28p 1/1 Running 0 24m 192.168.6.97 ip-192-168-20-77.ec2.internal <none> <none>
nasa-staging continuous-image-puller-dmxr6 1/1 Running 0 25m 192.168.11.138 ip-192-168-20-77.ec2.internal <none> <none>
kubectl describe pod continuous-image-puller-mt28p -n nasa-prod
Normal Scheduled 19m default-scheduler Successfully assigned nasa-prod/continuous-image-puller-mt28p to ip-192-168-20-77.ec2.internal
Warning FailedCreatePodSandBox 19m kubelet, ip-192-168-20-77.ec2.internal Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "23a039cde942063b50c495d258d9cfbc0e55d25dfc83f8a1b3ddf4588a9e9bbb" network for pod "continuous-image-puller-mt28p": NetworkPlugin cni failed to set up pod "continuous-image-puller-mt28p_nasa-prod" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused", failed to clean up sandbox container "23a039cde942063b50c495d258d9cfbc0e55d25dfc83f8a1b3ddf4588a9e9bbb" network for pod "continuous-image-puller-mt28p": NetworkPlugin cni failed to teardown pod "continuous-image-puller-mt28p_nasa-prod" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"]
Normal SandboxChanged 17m (x9 over 19m) kubelet, ip-192-168-20-77.ec2.internal Pod sandbox changed, it will be killed and re-created.
Normal Pulling 17m kubelet, ip-192-168-20-77.ec2.internal pulling image "783380859522.dkr.ecr.us-east-1.amazonaws.com/pangeo-nasa:e76d5bf"
Normal Pulled 16m kubelet, ip-192-168-20-77.ec2.internal Successfully pulled image "783380859522.dkr.ecr.us-east-1.amazonaws.com/pangeo-nasa:e76d5bf"
Normal Started 16m kubelet, ip-192-168-20-77.ec2.internal Started container
Normal Created 16m kubelet, ip-192-168-20-77.ec2.internal Created container
Normal Pulling 16m kubelet, ip-192-168-20-77.ec2.internal pulling image "scottyhq/geohackweek2019:latest"
Normal Pulled 15m kubelet, ip-192-168-20-77.ec2.internal Successfully pulled image "scottyhq/geohackweek2019:latest"
Normal Started 15m kubelet, ip-192-168-20-77.ec2.internal Started container
Normal Created 15m kubelet, ip-192-168-20-77.ec2.internal Created container
Normal Pulling 15m kubelet, ip-192-168-20-77.ec2.internal pulling image "scottyhq/pangeo-nasa-asp:2019.07.17"
Normal Pulled 14m kubelet, ip-192-168-20-77.ec2.internal Successfully pulled image "scottyhq/pangeo-nasa-asp:2019.07.17"
Normal Created 14m kubelet, ip-192-168-20-77.ec2.internal Created container
Normal Started 14m kubelet, ip-192-168-20-77.ec2.internal Started container
Normal Pulling 14m kubelet, ip-192-168-20-77.ec2.internal pulling image "pangeo/pangeo-notebook:latest"
Normal Pulled 13m kubelet, ip-192-168-20-77.ec2.internal Successfully pulled image "pangeo/pangeo-notebook:latest"
Normal Created 12m kubelet, ip-192-168-20-77.ec2.internal Created container
Normal Started 12m kubelet, ip-192-168-20-77.ec2.internal Started container
on the node, docker images
output (note large images sizes):
REPOSITORY TAG IMAGE ID CREATED SIZE
pangeo/pangeo-notebook latest 36ff77c04f9c 2 days ago 5.09GB
783380859522.dkr.ecr.us-east-1.amazonaws.com/pangeo-nasa e76d5bf 303561b3aba1 11 days ago 4.77GB
pangeo/base-notebook latest e46fad17850c 13 days ago 2.47GB
scottyhq/geohackweek2019 latest 557b13d2ca9f 5 weeks ago 4.28GB
scottyhq/pangeo-nasa-asp 2019.07.17 d4784b6d51f8 7 weeks ago 4.88GB
602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/kube-proxy v1.13.7 76e6800b6822 3 months ago 80.2MB
602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni v1.5.0 e9b29049f758 3 months ago 247MB
602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause-amd64 3.1 9e462c010bf3 15 months ago 742kB
gcr.io/google_containers/pause 3.0 99e59f495ffa 3 years ago 747kB
This does seem odd. I would be tempted to raise continuous image puller issue upstream in z2jh.
Ideally the node shouldn't be schedulable until the image puller is done. The idea is to avoid 10 workers being created on the node at once and all pulling the image at once. Pods only check for the image at schedule time and there is no effort to avoid duplicate pulls as far as I know.