pangeo-cloud-federation icon indicating copy to clipboard operation
pangeo-cloud-federation copied to clipboard

NFS issue - Mount failed for NFS V3 even after running rpcBind mount.nfs

Open consideRatio opened this issue 5 years ago • 0 comments

I've tried out using Google Filestore and the setup suggested by @yuvipanda with success! I've also enjoyed the benefits by being able to smoothly recover when my k8s cluster crashed beyond repair while upgrading from 1.11 to 1.12 due to a GKE TPU related issue. Not having one copy of GCP-PD/PV/PVC for each users, this was doable, so thank you all for guiding the path!!

Anyhow I have run into an issue with the setup though that probably will affect you as I copied your setup solution. The issue arise for me when autoscaling up in the morning and two user pods are starting up at the same time on a node that is about to become ready. It will work fine if they arrive one at the time though! I'm not confident about what and when the pods make things fail by being two attempting to do something at the same time though. It seems like when two user pods arriving within a minute of each other while both waiting for images etc to be pulled since the node is freshly created, the issue strikes!

I think I can mitegate most of this issue by having a quick startup of pods, but when it happens I'm forced to drain the node to recover!

This is the error as found in the events of the pods.

Events:Type Reason Age From Message
---- ------ ---- ---- -------
Normal TriggeredScaleUp 9m13s cluster-autoscaler pod triggered scale-up: [{https://content.googleapis.com/compute/v1/projects/ds-platform/zones/europe-west4-a/instanceGroups/gke-ds-platform-users-352836a1-grp 0->1 (max: 3)}]

Warning FailedScheduling 8m32s (x25 over 9m37s) jupyterhub-user-scheduler 0/3 nodes are available: 3 node(s) didn't match node selector.

Warning FailedMount 7m9s kubelet, gke-ds-platform-users-352836a1-7lb1 MountVolume.SetUp failed for volume "home-nfs" : mount failed: exit status 1
Mounting command: systemd-runMounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/180b5b91-34e6-11e9-bc8b-42010a401004/volumes/kubernetes.io~nfs/home-nfs --scope – /home/kubernetes/containerized_mounter/mounter mount -t nfs 10.64.16.18:/home /var/lib/kubelet/pods/180b5b91-34e6-11e9-bc8b-42010a401004/volumes/kubernetes.io~nfs/home-nfsOutput: Running scope as unit: run-r8fdfd62f64e44eb995557473092b3ab5.scopeMount failed: Mount failed for NFS V3 even after running rpcBind mount.nfs: rpc.statd is not running but is required for remote locking.mount.nfs: Either use '-o nolock' to keep locks local, or start statd.mount.nfs: an incorrect mount option was specified, exit status 32

Warning FailedMount 7m9s kubelet, gke-ds-platform-users-352836a1-7lb1 MountVolume.SetUp failed for volume "home-nfs" : mount failed: exit status 1Mounting command: systemd-runMounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/180b5b91-34e6-11e9-bc8b-42010a401004/volumes/kubernetes.io~nfs/home-nfs --scope – /home/kubernetes/containerized_mounter/mounter mount -t nfs 10.64.16.18:/home /var/lib/kubelet/pods/180b5b91-34e6-11e9-bc8b-42010a401004/volumes/kubernetes.io~nfs/home-nfsOutput: Running scope as unit: run-r811263fce8b34ac7a5389196e9458cdc.scopeMount failed: Mount issued for NFS V3 but unable to run rpcbind:Output: rpcbind: another rpcbind is already running. Aborting

Hmm so note that what fails does not relate to whats within the init-container or container, but the pod's volumes section.

  # From the jupyter-my-user pod's spec (not nested under a specific (init-)container)
  # As generated by the helm chart options `storage.type: static`
  volumes:
  - name: home
    persistentVolumeClaim:
      claimName: home-nfs

Note that this section was created due to:

https://github.com/pangeo-data/pangeo-cloud-federation/blob/2b1804962ad9b3e2df22cb3befec62f4ecd702eb/deployments/dev/config/common.yaml#L14-L18

Related

https://github.com/pangeo-data/pangeo-cloud-federation/blob/staging/pangeo-deploy/templates/home-storage.yaml https://github.com/pangeo-data/pangeo-cloud-federation/issues/25 https://github.com/pangeo-data/pangeo-cloud-federation/pull/28

consideRatio avatar Mar 20 '19 09:03 consideRatio