nni
nni copied to clipboard
Problem with permission denied in kubflow-nfs mode
Describe the issue:
Environment:
- NNI version: 2.2
- Training service (local|remote|pai|aml|etc): kubeflow
- Client OS: ubuntu 18.04
- Server OS (for remote mode only): ubuntu 18.04
- Python version: 3.6.8
- PyTorch/TensorFlow version: torch '1.8.1+cu102'
- Is conda/virtualenv/venv used?: conda
- Is running in Docker?: yes
Configuration:
- Experiment config (remember to remove secrets!):
authorName: default experimentName: example_mnist trialConcurrency: 2 maxExecDuration: 1h maxTrialNum: 10 #choice: local, remote, pai, kubeflow trainingServicePlatform: kubeflow searchSpacePath: search_space.json #choice: true, false useAnnotation: false nniManagerIp: "enp39s0" tuner: #choice: TPE, Random, Anneal, Evolution builtinTunerName: TPE classArgs: #choice: maximize, minimize optimize_mode: maximize assessor: builtinAssessorName: Medianstop classArgs: optimize_mode: maximize trial: codeDir: . master: replicas: 1 command: python3 dist_mnist.py gpuNum: 2 cpuNum: 1 memoryMB: 8196 image: msranni/nni:latest kubeflowConfig: operator: pytorch-operator apiVersion: v1 storage: nfs nfs: server: my-server-ip path: my-nfs-mount-path
- Search space: { "learning_rate":{"_type":"choice","_value":[0.0001, 0.001, 0.01, 0.1]}, "momentum":{"_type":"choice","_value":[0.4, 0.5, 0.6]} }
Log message:
- nnimanager.log:
- dispatcher.log:
- nnictl stdout and stderr:
kubectl logs nni-exp-myvgnmol-trial-czlnp-master-0
results
mkdir: cannot create directory '/tmp/mount/nni/myVgnMoL/CzlNP/code': Permission denied mkdir: cannot create directory '/tmp/mount/nni/myVgnMoL/CzlNP/output': Permission denied cp: cannot create directory '/tmp/mount/nni/myVgnMoL/CzlNP/code': Permission denied /usr/lib/python3/dist-packages/secretstorage/dhcrypto.py:15: CryptographyDeprecationWarning: int_from_bytes is deprecated, use int.from_bytes instead from cryptography.utils import int_from_bytes /usr/lib/python3/dist-packages/secretstorage/util.py:19: CryptographyDeprecationWarning: int_from_bytes is deprecated, use int.from_bytes instead from cryptography.utils import int_from_bytes Requirement already up-to-date: nni in /usr/local/lib/python3.6/dist-packages (2.2) Requirement already satisfied, skipping upgrade: schema in /usr/local/lib/python3.6/dist-packages (from nni) (0.7.4) Requirement already satisfied, skipping upgrade: responses in /usr/local/lib/python3.6/dist-packages (from nni) (0.13.2) Requirement already satisfied, skipping upgrade: ruamel.yaml in /usr/local/lib/python3.6/dist-packages (from nni) (0.17.4) Requirement already satisfied, skipping upgrade: requests in /usr/local/lib/python3.6/dist-packages (from nni) (2.25.1) Requirement already satisfied, skipping upgrade: PythonWebHDFS in /usr/local/lib/python3.6/dist-packages (from nni) (0.2.3) Requirement already satisfied, skipping upgrade: filelock in /usr/local/lib/python3.6/dist-packages (from nni) (3.0.12) Requirement already satisfied, skipping upgrade: psutil in /usr/local/lib/python3.6/dist-packages (from nni) (5.8.0) Requirement already satisfied, skipping upgrade: hyperopt==0.1.2 in /usr/local/lib/python3.6/dist-packages (from nni) (0.1.2) Requirement already satisfied, skipping upgrade: scikit-learn>=0.24.1 in /usr/local/lib/python3.6/dist-packages (from nni) (0.24.1) Requirement already satisfied, skipping upgrade: websockets in /usr/local/lib/python3.6/dist-packages (from nni) (8.1) Requirement already satisfied, skipping upgrade: prettytable in /usr/local/lib/python3.6/dist-packages (from nni) (2.1.0) Requirement already satisfied, skipping upgrade: dataclasses; python_version < "3.7" in /usr/local/lib/python3.6/dist-packages (from nni) (0.8) Requirement already satisfied, skipping upgrade: scipy<1.6; python_version < "3.7" in /usr/local/lib/python3.6/dist-packages (from nni) (1.1.0) Requirement already satisfied, skipping upgrade: netifaces in /usr/local/lib/python3.6/dist-packages (from nni) (0.10.9) Requirement already satisfied, skipping upgrade: colorama in /usr/local/lib/python3.6/dist-packages (from nni) (0.4.4) Requirement already satisfied, skipping upgrade: json-tricks in /usr/local/lib/python3.6/dist-packages (from nni) (3.15.5) Requirement already satisfied, skipping upgrade: numpy<1.20; sys_platform != "win32" and python_version < "3.7" in /usr/local/lib/python3.6/dist-packages (from nni) (1.18.5) Requirement already satisfied, skipping upgrade: astor in /usr/local/lib/python3.6/dist-packages (from nni) (0.8.1) Requirement already satisfied, skipping upgrade: contextlib2>=0.5.5 in /usr/local/lib/python3.6/dist-packages (from schema->nni) (0.6.0.post1) Requirement already satisfied, skipping upgrade: six in /usr/local/lib/python3.6/dist-packages (from responses->nni) (1.15.0) Requirement already satisfied, skipping upgrade: urllib3>=1.25.10 in /usr/local/lib/python3.6/dist-packages (from responses->nni) (1.26.4) Requirement already satisfied, skipping upgrade: ruamel.yaml.clib>=0.1.2; platform_python_implementation == "CPython" and python_version < "3.10" in /usr/local/lib/python3.6/dist-packages (from ruamel.yaml->nni) (0.2.2) Requirement already satisfied, skipping upgrade: chardet<5,>=3.0.2 in /usr/lib/python3/dist-packages (from requests->nni) (3.0.4) Requirement already satisfied, skipping upgrade: idna<3,>=2.5 in /usr/lib/python3/dist-packages (from requests->nni) (2.6) Requirement already satisfied, skipping upgrade: certifi>=2017.4.17 in /usr/lib/python3/dist-packages (from requests->nni) (2018.1.18) Requirement already satisfied, skipping upgrade: simplejson in /usr/local/lib/python3.6/dist-packages (from PythonWebHDFS->nni) (3.17.2) Requirement already satisfied, skipping upgrade: future in /usr/local/lib/python3.6/dist-packages (from hyperopt==0.1.2->nni) (0.18.2) Requirement already satisfied, skipping upgrade: pymongo in /usr/local/lib/python3.6/dist-packages (from hyperopt==0.1.2->nni) (3.11.3) Requirement already satisfied, skipping upgrade: networkx in /usr/local/lib/python3.6/dist-packages (from hyperopt==0.1.2->nni) (2.5.1) Requirement already satisfied, skipping upgrade: tqdm in /usr/local/lib/python3.6/dist-packages (from hyperopt==0.1.2->nni) (4.60.0) Requirement already satisfied, skipping upgrade: joblib>=0.11 in /usr/local/lib/python3.6/dist-packages (from scikit-learn>=0.24.1->nni) (1.0.1) Requirement already satisfied, skipping upgrade: threadpoolctl>=2.0.0 in /usr/local/lib/python3.6/dist-packages (from scikit-learn>=0.24.1->nni) (2.1.0) Requirement already satisfied, skipping upgrade: wcwidth in /usr/local/lib/python3.6/dist-packages (from prettytable->nni) (0.2.5) Requirement already satisfied, skipping upgrade: importlib-metadata; python_version < "3.8" in /usr/local/lib/python3.6/dist-packages (from prettytable->nni) (4.0.1) Requirement already satisfied, skipping upgrade: decorator<5,>=4.3 in /usr/local/lib/python3.6/dist-packages (from networkx->hyperopt==0.1.2->nni) (4.4.2) Requirement already satisfied, skipping upgrade: typing-extensions>=3.6.4; python_version < "3.8" in /usr/local/lib/python3.6/dist-packages (from importlib-metadata; python_version < "3.8"->prettytable->nni) (3.7.4.3) Requirement already satisfied, skipping upgrade: zipp>=0.5 in /usr/local/lib/python3.6/dist-packages (from importlib-metadata; python_version < "3.8"->prettytable->nni) (3.4.1) WARNING: You are using pip version 20.2.4; however, version 21.1.2 is available. You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command. /tmp/mount/nni/myVgnMoL/CzlNP/run_master.sh: 15: cd: can't cd to /tmp/mount/nni/myVgnMoL/CzlNP/code /tmp/mount/nni/myVgnMoL/CzlNP/run_master.sh: 16: /tmp/mount/nni/myVgnMoL/CzlNP/run_master.sh: cannot create /tmp/mount/nni/myVgnMoL/CzlNP/output/master_output/trialkeeper_stdout: Directory nonexistent
I tried using both the root_squash and no_root_squash options on the nfs server, but there is a lack of permission error I tried chmod-R777 on nfs folder/ and /tmp/mount/nni, or chmod -R 700 but the result was the same.
or show error below: sh: 0: Can't open /tmp/mount/nni/UZRxHv9l/XRNoa/run_master.sh (and there is no /tmp/mount/nni/UZRxHv9I folder)
How do I solve this?
How to reproduce it?:
NNI does not have permission to access /tmp/mount
folder, how do you mount your nfs server? Not sure if this is your user account/group issue, did you tried create a file under the folder manually?
Yes i created /tmp/mount folder manually I mounted nfs server with command mount (nfs server ip):/(nfs_folder) /(mounted folder) with rw, sync, insecure, no_subtreecheck, root_squash option
it is not mounted at /tmp/mount, /home/user/data folder.. shoud i mount at /tmp/mount folder again?
I tried to mount the /tmp/mount folder again, but the same error appears. @SparkSnail
I see, this /tmp/mount
folder is created in container by kubernetes, and nfs path is mounted into container by volumeMounts, refer.
Suggest to submit a kubeflow job setting NFS volumeMounts without NNI to debug, and check if the mounted folder in container has RW permission. Not sure if this is the NFS server issue.
current /etc/exports option in nfs server
and the results of ls -al are as below (nfs server)
the reuslts of ls -al in nfs client are as below
is there any problem?
i mounted nfs like; `[at nfs server] sudo apt install nfs-common sudo apt install nfs-kernel-server sudo mkdir /home/plask/nfsroot
sudo vi /etc/exports /home/plask/nfsroot *(rw,sync, insecure, root_squash,no_subtree_check) 추가 sudo service nfs-kernel-server restart
sudo systemctl enable rpcbind sudo systemctl enable nfs-server sudo systemctl start rpcbind sudo systemctl start nfs-server
sudo exportfs -a sudo ufw allow from 192.168.1.198 to any port nfs (nfs client ip)
[at nfs client] sudo apt-get install nfs-common sudo mkdir -p /home/plask/vibeData sudo mount 192.168.1.200:/home/plask/nfsroot /home/plask/vibeData
helm repo add raphael https://raphaelmonrouzeau.github.io/charts/repository/
helm repo update
helm install nfs-provisioner
--set nfs.server=192.168.1.200
--set nfs.path=/home/plask/nfsroot
--set storageClass.defaultClass=true
--set storageClass.name=nfs-provisioner
raphael/nfs-server-provisioner`
@SparkSnail
I didn't find issue in your nfs command, I think you should check if kubernetes could mount your NFS server into container using NFS volumeMounts successfully. https://kubernetes.io/docs/concepts/storage/volumes/#nfs
sh: 0: Can't open /tmp/mount/nni/UZRxHv9l/XRNoa/run_master.sh I met the same problem, have you solve this problem?