nni
nni copied to clipboard
Having trouble nni with frameworkcontroller on k8s
Describe the issue: When I tried nni with frameworkcontroller on k8s, I used these yaml files
- I tried nfs
for nni config
config_framework.yml
authorName: default
experimentName: example_mnist_pytorch
trialConcurrency: 1
maxExecDuration: 1h
maxTrialNum: 10
#choice: local, remote, pai, kubeflow
trainingServicePlatform: frameworkcontroller
searchSpacePath: search_space.json
#choice: true, false
useAnnotation: false
tuner:
#choice: TPE, Random, Anneal, Evolution, BatchTuner, MetisTuner, GPTuner
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
assessor:
builtinAssessorName: Medianstop
classArgs:
optimize_mode: maximize
trial:
codeDir: .
taskRoles:
- name: worker
taskNum: 1
command: python3 mnist.py
gpuNum: 0
cpuNum: 1
memoryMB: 8192
image: msranni/nni:latest
frameworkAttemptCompletionPolicy:
minFailedTaskCount: 3
minSucceededTaskCount: 1
frameworkcontrollerConfig:
storage: nfs
nfs:
# Your NFS server IP, like 10.10.10.10
server: 192.168.1.106
# Your NFS server export path, like /var/nfs/nni
path: /home/mj_lee/mount
serviceAccountName: frameworkcontroller
and for frameworkcontroller Statefulset
frameworkcontroller-with-default-config.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: frameworkcontroller
namespace: default
spec:
serviceName: frameworkcontroller
selector:
matchLabels:
app: frameworkcontroller
replicas: 1
template:
metadata:
labels:
app: frameworkcontroller
spec:
# Using the ServiceAccount with granted permission
# if the k8s cluster enforces authorization.
serviceAccountName: frameworkcontroller
containers:
- name: frameworkcontroller
image: frameworkcontroller/frameworkcontroller
# Using k8s inClusterConfig, so usually, no need to specify
# KUBE_APISERVER_ADDRESS or KUBECONFIG
env:
#- name: KUBE_APISERVER_ADDRESS
# value: {http[s]://host:port}
- name: KUBECONFIG
value: ~/.kube/config
and execute below command for k8s statefulset
kubectl apply -f frameworkcontroller-with-default-config.yaml
then frameworkcontroller-0 set to Run

and execute nnictl command
nnictl create --config config_framework.yml
then new experiment worker pod created
but it failed to run

when I check logs by kubectl logs nniexp~

so I checked the nfs mount directory
,
and there is not nni directory
, but It has envs
directory and run.sh
file

I think it should create nni/experiment_id/run.sh
in mount folder
here is describe of nniexp-worker-0
pod
Name: nniexpr2ys5f9aenvzchoa-worker-0
Namespace: default
Priority: 0
Node: zerooneai-p210908-4/192.168.1.104
Start Time: Fri, 25 Feb 2022 14:33:07 +0900
Labels: FC_FRAMEWORK_NAME=nniexpr2ys5f9aenvzchoa
FC_TASKROLE_NAME=worker
FC_TASK_INDEX=0
Annotations: FC_CONFIGMAP_NAME: nniexpr2ys5f9aenvzchoa-attempt
FC_CONFIGMAP_UID: 0be50971-55f8-434a-bfe4-6b47d64212eb
FC_FRAMEWORK_ATTEMPT_ID: 0
FC_FRAMEWORK_ATTEMPT_INSTANCE_UID: 0_0be50971-55f8-434a-bfe4-6b47d64212eb
FC_FRAMEWORK_NAME: nniexpr2ys5f9aenvzchoa
FC_FRAMEWORK_NAMESPACE: default
FC_FRAMEWORK_UID: 2c55ec33-69b8-43a4-a643-84ff4e0604b2
FC_POD_NAME: nniexpr2ys5f9aenvzchoa-worker-0
FC_TASKROLE_NAME: worker
FC_TASKROLE_UID: 751bf95b-c6bd-4dd0-aafe-e160f9c10220
FC_TASK_ATTEMPT_ID: 0
FC_TASK_INDEX: 0
FC_TASK_UID: 27edb2eb-a5ca-4e65-8a2c-57dd15cabccb
cni.projectcalico.org/podIP: 10.0.243.33/32
cni.projectcalico.org/podIPs: 10.0.243.33/32
Status: Running
IP: 10.0.243.33
IPs:
IP: 10.0.243.33
Controlled By: ConfigMap/nniexpr2ys5f9aenvzchoa-attempt
Init Containers:
frameworkbarrier:
Container ID: docker://b05885b647cdb41dba4587f6f93eeb5bd19a390641687012bc017d73cc21aa79
Image: frameworkcontroller/frameworkbarrier
Image ID: docker-pullable://frameworkcontroller/frameworkbarrier@sha256:9d95e31152460e3cc5c7ad2b09738c1fdb540ff7a50abc72b2f8f9d0badb87da
Port: <none>
Host Port: <none>
State: Terminated
Reason: Completed
Exit Code: 0
Started: Fri, 25 Feb 2022 14:33:12 +0900
Finished: Fri, 25 Feb 2022 14:33:22 +0900
Ready: True
Restart Count: 0
Environment:
FC_FRAMEWORK_NAMESPACE: default
FC_FRAMEWORK_NAME: nniexpr2ys5f9aenvzchoa
FC_TASKROLE_NAME: worker
FC_TASK_INDEX: 0
FC_CONFIGMAP_NAME: nniexpr2ys5f9aenvzchoa-attempt
FC_POD_NAME: nniexpr2ys5f9aenvzchoa-worker-0
FC_FRAMEWORK_UID: 2c55ec33-69b8-43a4-a643-84ff4e0604b2
FC_FRAMEWORK_ATTEMPT_ID: 0
FC_FRAMEWORK_ATTEMPT_INSTANCE_UID: 0_0be50971-55f8-434a-bfe4-6b47d64212eb
FC_CONFIGMAP_UID: 0be50971-55f8-434a-bfe4-6b47d64212eb
FC_TASKROLE_UID: 751bf95b-c6bd-4dd0-aafe-e160f9c10220
FC_TASK_UID: 27edb2eb-a5ca-4e65-8a2c-57dd15cabccb
FC_TASK_ATTEMPT_ID: 0
FC_POD_UID: (v1:metadata.uid)
FC_TASK_ATTEMPT_INSTANCE_UID: 0_$(FC_POD_UID)
Mounts:
/mnt/frameworkbarrier from frameworkbarrier-volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from frameworkcontroller-token-7sw6q (ro)
Containers:
framework:
Container ID: docker://dc9ff6579c67e8bc394c734e8add70fbdb581d014541044d1877a9e5d888f828
Image: msranni/nni:latest
Image ID: docker-pullable://msranni/nni@sha256:8985fb134204ef523e113ac4a572ae7460cd246a5ff471df413f7d17dd917cd1
Port: 4000/TCP
Host Port: 0/TCP
Command:
sh
/tmp/mount/nni/r2ys5f9a/run.sh
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Message: sh: 0: Can't open /tmp/mount/nni/r2ys5f9a/run.sh
Exit Code: 127
Started: Fri, 25 Feb 2022 14:36:43 +0900
Finished: Fri, 25 Feb 2022 14:36:43 +0900
Ready: False
Restart Count: 5
Limits:
cpu: 1
memory: 8Gi
Requests:
cpu: 1
memory: 8Gi
Environment:
FC_FRAMEWORK_NAMESPACE: default
FC_FRAMEWORK_NAME: nniexpr2ys5f9aenvzchoa
FC_TASKROLE_NAME: worker
FC_TASK_INDEX: 0
FC_CONFIGMAP_NAME: nniexpr2ys5f9aenvzchoa-attempt
FC_POD_NAME: nniexpr2ys5f9aenvzchoa-worker-0
FC_FRAMEWORK_UID: 2c55ec33-69b8-43a4-a643-84ff4e0604b2
FC_FRAMEWORK_ATTEMPT_ID: 0
FC_FRAMEWORK_ATTEMPT_INSTANCE_UID: 0_0be50971-55f8-434a-bfe4-6b47d64212eb
FC_CONFIGMAP_UID: 0be50971-55f8-434a-bfe4-6b47d64212eb
FC_TASKROLE_UID: 751bf95b-c6bd-4dd0-aafe-e160f9c10220
FC_TASK_UID: 27edb2eb-a5ca-4e65-8a2c-57dd15cabccb
FC_TASK_ATTEMPT_ID: 0
FC_POD_UID: (v1:metadata.uid)
FC_TASK_ATTEMPT_INSTANCE_UID: 0_$(FC_POD_UID)
Mounts:
/mnt/frameworkbarrier from frameworkbarrier-volume (rw)
/tmp/mount from nni-vol (rw)
/var/run/secrets/kubernetes.io/serviceaccount from frameworkcontroller-token-7sw6q (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
nni-vol:
Type: NFS (an NFS mount that lasts the lifetime of a pod)
Server: 192.168.1.106
Path: /home/zerooneai/mj_lee/mount
ReadOnly: false
frameworkbarrier-volume:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
frameworkcontroller-token-7sw6q:
Type: Secret (a volume populated by a Secret)
SecretName: frameworkcontroller-token-7sw6q
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 6m19s default-scheduler Successfully assigned default/nniexpr2ys5f9aenvzchoa-worker-0 to zerooneai-p210908-4
Normal Pulling 6m18s kubelet Pulling image "frameworkcontroller/frameworkbarrier"
Normal Pulled 6m15s kubelet Successfully pulled image "frameworkcontroller/frameworkbarrier" in 3.364620261s
Normal Created 6m14s kubelet Created container frameworkbarrier
Normal Started 6m14s kubelet Started container frameworkbarrier
Normal Pulled 6m1s kubelet Successfully pulled image "msranni/nni:latest" in 2.375328373s
Normal Pulled 5m56s kubelet Successfully pulled image "msranni/nni:latest" in 4.709013579s
Normal Pulled 5m36s kubelet Successfully pulled image "msranni/nni:latest" in 2.373976028s
Normal Pulling 5m9s (x4 over 6m4s) kubelet Pulling image "msranni/nni:latest"
Normal Created 5m7s (x4 over 6m1s) kubelet Created container framework
Normal Pulled 5m7s kubelet Successfully pulled image "msranni/nni:latest" in 2.484752039s
Normal Started 5m6s (x4 over 6m1s) kubelet Started container framework
Warning BackOff 71s (x22 over 5m54s) kubelet Back-off restarting failed container
please let me know how to solving this trouble thanks!
Environment:
- NNI version: 2.6
- Training service (local|remote|pai|aml|etc): frameworkcontroller
- Client OS: ubuntu 18.04
- Server OS (for remote mode only):
- Python version: 3.6.9
- PyTorch/TensorFlow version: 1.10.1+cu102
Hi, do we have a solution for this?
Face a similar issue(in Kubeflow training service), I fix it by hacking trialDispatcher.ts, kubeflowEnvironmentService.ts, kubernetesEnvironmentService.ts.
Suppose experiment id is ABCDE The reason for this phenomenon happened is that the entry point("run.sh") and envs(contain the execution environment) are uploaded to the root path of the NFS Server, but the content of the “run.sh” is
sh /tmp/mount/nni/ABCDE/run.sh && ...
So, it will raise "can't open /tmp/mount/nni/ABCDE/run.sh" error.
P.S. In the container, the NFS path will be mounted to /tmp/mount
.
P.P.S When the trial concurrency is large than 1, the run.sh will be overwritten by other environments.
I'll create a PR ASAP to fix this issue.
Related Issues: https://github.com/microsoft/frameworkcontroller/issues/75, https://github.com/microsoft/nni/issues/4874, https://github.com/microsoft/nni/issues/5026.
Face a similar issue(in Kubeflow training service), I fix it by hacking trialDispatcher.ts, kubeflowEnvironmentService.ts, kubernetesEnvironmentService.ts.
Suppose experiment id is ABCDE The reason for this phenomenon happened is that the entry point("run.sh") and envs(contain the execution environment) are uploaded to the root path of the NFS Server, but the content of the “run.sh” is
sh /tmp/mount/nni/ABCDE/run.sh && ...
So, it will raise "can't open /tmp/mount/nni/ABCDE/run.sh" error.
P.S. In the container, the NFS path will be mounted to
/tmp/mount
. P.P.S When the trial concurrency is large than 1, the run.sh will be overwritten by other environments.I'll create a PR ASAP to fix this issue.
Related Issues: microsoft/frameworkcontroller#75, #4874, #5026.
Thanks, I figured that out too. so you also modify the ts files and build nni again from source? I did not know about the overwriting, the overwritten of run.sh will not be a problem, right? because as I remember, it always creates a new env folder for each trial and runs the code there?
Face a similar issue(in Kubeflow training service), I fix it by hacking trialDispatcher.ts, kubeflowEnvironmentService.ts, kubernetesEnvironmentService.ts. Suppose experiment id is ABCDE The reason for this phenomenon happened is that the entry point("run.sh") and envs(contain the execution environment) are uploaded to the root path of the NFS Server, but the content of the “run.sh” is
sh /tmp/mount/nni/ABCDE/run.sh && ...
So, it will raise "can't open /tmp/mount/nni/ABCDE/run.sh" error. P.S. In the container, the NFS path will be mounted to
/tmp/mount
. P.P.S When the trial concurrency is large than 1, the run.sh will be overwritten by other environments. I'll create a PR ASAP to fix this issue. Related Issues: microsoft/frameworkcontroller#75, #4874, #5026.Thanks, I figured that out too. so you also modify the ts files and build nni again from source? I did not know about the overwriting, the overwritten of run.sh will not be a problem, right? because as I remember, it always creates a new env folder for each trial and runs the code there?
Sorry for the late replay.
so you also modify the ts files and build nni again from source
Yes, I modify the TS files and build NNI wheel from source.
it always creates a new env folder for each trial and runs the code there
No, I think it only create the number of trialConcurrency
envs at all, and assign each trial to a free env(like round-robin), maybe that's why it's called reusable
?
For the overwriting problem, every env will actually run the same(the latest generated) run.sh script, and https://github.com/microsoft/nni/blob/v2.8/nni/tools/trial_tool/trial_runner.py#L164 will use dir name as runner id, finally it will raise error.
an example of run.sh
cd /tmp/mount/nni/5nfd2kzc && mkdir -p envs/ZKtWr && cd envs/ZKtWr && sh ../install_nni.sh && python3 -m nni.tools.trial_tool.trial_runner 1>/tmp/mount/nni/5nfd2kzc/envs/ZKtWr/trialrunner_stdout 2>/tmp/mount/nni/5nfd2kzc/envs/ZKtWr/trialrunner_stderr
Every envs will use ZKtWr
as runner_id.
@amznero at first, I move the run.sh file to the correct experiment folder, but then, the trial doesn't seem to run concurrently, and yes, as you say, I also found out that the last config is applied for every worker (environment). So then my solution is to create different run.sh files for each envs, e.g: run_zktwr.sh and also refine the start command for each worker accordingly. This seems to bring the concurrency to work, but the trials then take more time than when there's only 1 worker. Is it also the case for you? If not, can I have your source code, thank you in advance
@vincenthp2603
but the trials then take more time than when there's only 1 worker.
Does "time" mean training duration? If so, this scenario didn't happen to me, and I don't think the concurrency will affect the training duration.
You can freeze random seeds(NumPy, torch, cuda, cudnn, et al) and set worker=1 to record the experiment baseline(batch size, epoch, model parameters, training duration). Then use concurrent mode to train the model and compare it with the baseline. Maybe the training duration is related to model complexity or training strategies(like Genetic Algorithm)?
You can see my changes here: https://github.com/microsoft/nni/pull/5045.
NNI v2.9 has been released.