nni icon indicating copy to clipboard operation
nni copied to clipboard

Problem with permission denied in kubflow-nfs mode

Open seoha-kim opened this issue 3 years ago • 8 comments

Describe the issue:

Environment:

  • NNI version: 2.2
  • Training service (local|remote|pai|aml|etc): kubeflow
  • Client OS: ubuntu 18.04
  • Server OS (for remote mode only): ubuntu 18.04
  • Python version: 3.6.8
  • PyTorch/TensorFlow version: torch '1.8.1+cu102'
  • Is conda/virtualenv/venv used?: conda
  • Is running in Docker?: yes

Configuration:

  • Experiment config (remember to remove secrets!):

authorName: default experimentName: example_mnist trialConcurrency: 2 maxExecDuration: 1h maxTrialNum: 10 #choice: local, remote, pai, kubeflow trainingServicePlatform: kubeflow searchSpacePath: search_space.json #choice: true, false useAnnotation: false nniManagerIp: "enp39s0" tuner: #choice: TPE, Random, Anneal, Evolution builtinTunerName: TPE classArgs: #choice: maximize, minimize optimize_mode: maximize assessor: builtinAssessorName: Medianstop classArgs: optimize_mode: maximize trial: codeDir: . master: replicas: 1 command: python3 dist_mnist.py gpuNum: 2 cpuNum: 1 memoryMB: 8196 image: msranni/nni:latest kubeflowConfig: operator: pytorch-operator apiVersion: v1 storage: nfs nfs: server: my-server-ip path: my-nfs-mount-path

  • Search space: { "learning_rate":{"_type":"choice","_value":[0.0001, 0.001, 0.01, 0.1]}, "momentum":{"_type":"choice","_value":[0.4, 0.5, 0.6]} }

Log message:

  • nnimanager.log:
  • dispatcher.log:
  • nnictl stdout and stderr:

kubectl logs nni-exp-myvgnmol-trial-czlnp-master-0

results

mkdir: cannot create directory '/tmp/mount/nni/myVgnMoL/CzlNP/code': Permission denied mkdir: cannot create directory '/tmp/mount/nni/myVgnMoL/CzlNP/output': Permission denied cp: cannot create directory '/tmp/mount/nni/myVgnMoL/CzlNP/code': Permission denied /usr/lib/python3/dist-packages/secretstorage/dhcrypto.py:15: CryptographyDeprecationWarning: int_from_bytes is deprecated, use int.from_bytes instead from cryptography.utils import int_from_bytes /usr/lib/python3/dist-packages/secretstorage/util.py:19: CryptographyDeprecationWarning: int_from_bytes is deprecated, use int.from_bytes instead from cryptography.utils import int_from_bytes Requirement already up-to-date: nni in /usr/local/lib/python3.6/dist-packages (2.2) Requirement already satisfied, skipping upgrade: schema in /usr/local/lib/python3.6/dist-packages (from nni) (0.7.4) Requirement already satisfied, skipping upgrade: responses in /usr/local/lib/python3.6/dist-packages (from nni) (0.13.2) Requirement already satisfied, skipping upgrade: ruamel.yaml in /usr/local/lib/python3.6/dist-packages (from nni) (0.17.4) Requirement already satisfied, skipping upgrade: requests in /usr/local/lib/python3.6/dist-packages (from nni) (2.25.1) Requirement already satisfied, skipping upgrade: PythonWebHDFS in /usr/local/lib/python3.6/dist-packages (from nni) (0.2.3) Requirement already satisfied, skipping upgrade: filelock in /usr/local/lib/python3.6/dist-packages (from nni) (3.0.12) Requirement already satisfied, skipping upgrade: psutil in /usr/local/lib/python3.6/dist-packages (from nni) (5.8.0) Requirement already satisfied, skipping upgrade: hyperopt==0.1.2 in /usr/local/lib/python3.6/dist-packages (from nni) (0.1.2) Requirement already satisfied, skipping upgrade: scikit-learn>=0.24.1 in /usr/local/lib/python3.6/dist-packages (from nni) (0.24.1) Requirement already satisfied, skipping upgrade: websockets in /usr/local/lib/python3.6/dist-packages (from nni) (8.1) Requirement already satisfied, skipping upgrade: prettytable in /usr/local/lib/python3.6/dist-packages (from nni) (2.1.0) Requirement already satisfied, skipping upgrade: dataclasses; python_version < "3.7" in /usr/local/lib/python3.6/dist-packages (from nni) (0.8) Requirement already satisfied, skipping upgrade: scipy<1.6; python_version < "3.7" in /usr/local/lib/python3.6/dist-packages (from nni) (1.1.0) Requirement already satisfied, skipping upgrade: netifaces in /usr/local/lib/python3.6/dist-packages (from nni) (0.10.9) Requirement already satisfied, skipping upgrade: colorama in /usr/local/lib/python3.6/dist-packages (from nni) (0.4.4) Requirement already satisfied, skipping upgrade: json-tricks in /usr/local/lib/python3.6/dist-packages (from nni) (3.15.5) Requirement already satisfied, skipping upgrade: numpy<1.20; sys_platform != "win32" and python_version < "3.7" in /usr/local/lib/python3.6/dist-packages (from nni) (1.18.5) Requirement already satisfied, skipping upgrade: astor in /usr/local/lib/python3.6/dist-packages (from nni) (0.8.1) Requirement already satisfied, skipping upgrade: contextlib2>=0.5.5 in /usr/local/lib/python3.6/dist-packages (from schema->nni) (0.6.0.post1) Requirement already satisfied, skipping upgrade: six in /usr/local/lib/python3.6/dist-packages (from responses->nni) (1.15.0) Requirement already satisfied, skipping upgrade: urllib3>=1.25.10 in /usr/local/lib/python3.6/dist-packages (from responses->nni) (1.26.4) Requirement already satisfied, skipping upgrade: ruamel.yaml.clib>=0.1.2; platform_python_implementation == "CPython" and python_version < "3.10" in /usr/local/lib/python3.6/dist-packages (from ruamel.yaml->nni) (0.2.2) Requirement already satisfied, skipping upgrade: chardet<5,>=3.0.2 in /usr/lib/python3/dist-packages (from requests->nni) (3.0.4) Requirement already satisfied, skipping upgrade: idna<3,>=2.5 in /usr/lib/python3/dist-packages (from requests->nni) (2.6) Requirement already satisfied, skipping upgrade: certifi>=2017.4.17 in /usr/lib/python3/dist-packages (from requests->nni) (2018.1.18) Requirement already satisfied, skipping upgrade: simplejson in /usr/local/lib/python3.6/dist-packages (from PythonWebHDFS->nni) (3.17.2) Requirement already satisfied, skipping upgrade: future in /usr/local/lib/python3.6/dist-packages (from hyperopt==0.1.2->nni) (0.18.2) Requirement already satisfied, skipping upgrade: pymongo in /usr/local/lib/python3.6/dist-packages (from hyperopt==0.1.2->nni) (3.11.3) Requirement already satisfied, skipping upgrade: networkx in /usr/local/lib/python3.6/dist-packages (from hyperopt==0.1.2->nni) (2.5.1) Requirement already satisfied, skipping upgrade: tqdm in /usr/local/lib/python3.6/dist-packages (from hyperopt==0.1.2->nni) (4.60.0) Requirement already satisfied, skipping upgrade: joblib>=0.11 in /usr/local/lib/python3.6/dist-packages (from scikit-learn>=0.24.1->nni) (1.0.1) Requirement already satisfied, skipping upgrade: threadpoolctl>=2.0.0 in /usr/local/lib/python3.6/dist-packages (from scikit-learn>=0.24.1->nni) (2.1.0) Requirement already satisfied, skipping upgrade: wcwidth in /usr/local/lib/python3.6/dist-packages (from prettytable->nni) (0.2.5) Requirement already satisfied, skipping upgrade: importlib-metadata; python_version < "3.8" in /usr/local/lib/python3.6/dist-packages (from prettytable->nni) (4.0.1) Requirement already satisfied, skipping upgrade: decorator<5,>=4.3 in /usr/local/lib/python3.6/dist-packages (from networkx->hyperopt==0.1.2->nni) (4.4.2) Requirement already satisfied, skipping upgrade: typing-extensions>=3.6.4; python_version < "3.8" in /usr/local/lib/python3.6/dist-packages (from importlib-metadata; python_version < "3.8"->prettytable->nni) (3.7.4.3) Requirement already satisfied, skipping upgrade: zipp>=0.5 in /usr/local/lib/python3.6/dist-packages (from importlib-metadata; python_version < "3.8"->prettytable->nni) (3.4.1) WARNING: You are using pip version 20.2.4; however, version 21.1.2 is available. You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command. /tmp/mount/nni/myVgnMoL/CzlNP/run_master.sh: 15: cd: can't cd to /tmp/mount/nni/myVgnMoL/CzlNP/code /tmp/mount/nni/myVgnMoL/CzlNP/run_master.sh: 16: /tmp/mount/nni/myVgnMoL/CzlNP/run_master.sh: cannot create /tmp/mount/nni/myVgnMoL/CzlNP/output/master_output/trialkeeper_stdout: Directory nonexistent

I tried using both the root_squash and no_root_squash options on the nfs server, but there is a lack of permission error I tried chmod-R777 on nfs folder/ and /tmp/mount/nni, or chmod -R 700 but the result was the same.

or show error below: sh: 0: Can't open /tmp/mount/nni/UZRxHv9l/XRNoa/run_master.sh (and there is no /tmp/mount/nni/UZRxHv9I folder)

How do I solve this?

How to reproduce it?:

seoha-kim avatar May 25 '21 09:05 seoha-kim

NNI does not have permission to access /tmp/mount folder, how do you mount your nfs server? Not sure if this is your user account/group issue, did you tried create a file under the folder manually?

SparkSnail avatar May 28 '21 09:05 SparkSnail

Yes i created /tmp/mount folder manually I mounted nfs server with command mount (nfs server ip):/(nfs_folder) /(mounted folder) with rw, sync, insecure, no_subtreecheck, root_squash option

seoha-kim avatar May 29 '21 03:05 seoha-kim

it is not mounted at /tmp/mount, /home/user/data folder.. shoud i mount at /tmp/mount folder again?

seoha-kim avatar May 29 '21 03:05 seoha-kim

I tried to mount the /tmp/mount folder again, but the same error appears. @SparkSnail

ghost avatar May 31 '21 02:05 ghost

I see, this /tmp/mount folder is created in container by kubernetes, and nfs path is mounted into container by volumeMounts, refer. Suggest to submit a kubeflow job setting NFS volumeMounts without NNI to debug, and check if the mounted folder in container has RW permission. Not sure if this is the NFS server issue.

SparkSnail avatar May 31 '21 03:05 SparkSnail

image

current /etc/exports option in nfs server

and the results of ls -al are as below (nfs server) image image

the reuslts of ls -al in nfs client are as below image image image image

is there any problem?

i mounted nfs like; `[at nfs server] sudo apt install nfs-common sudo apt install nfs-kernel-server sudo mkdir /home/plask/nfsroot

sudo vi /etc/exports /home/plask/nfsroot *(rw,sync, insecure, root_squash,no_subtree_check) 추가 sudo service nfs-kernel-server restart

sudo systemctl enable rpcbind sudo systemctl enable nfs-server sudo systemctl start rpcbind sudo systemctl start nfs-server

sudo exportfs -a sudo ufw allow from 192.168.1.198 to any port nfs (nfs client ip)

[at nfs client] sudo apt-get install nfs-common sudo mkdir -p /home/plask/vibeData sudo mount 192.168.1.200:/home/plask/nfsroot /home/plask/vibeData

helm repo add raphael https://raphaelmonrouzeau.github.io/charts/repository/ helm repo update helm install nfs-provisioner
--set nfs.server=192.168.1.200
--set nfs.path=/home/plask/nfsroot
--set storageClass.defaultClass=true
--set storageClass.name=nfs-provisioner
raphael/nfs-server-provisioner`

@SparkSnail

ghost avatar Jun 03 '21 08:06 ghost

I didn't find issue in your nfs command, I think you should check if kubernetes could mount your NFS server into container using NFS volumeMounts successfully. https://kubernetes.io/docs/concepts/storage/volumes/#nfs

SparkSnail avatar Jun 07 '21 06:06 SparkSnail

sh: 0: Can't open /tmp/mount/nni/UZRxHv9l/XRNoa/run_master.sh I met the same problem, have you solve this problem?

N-Kingsley avatar Jul 20 '22 06:07 N-Kingsley