training-operator
training-operator copied to clipboard
master pod not getting started for pytorch job
I'm trying to run training operator standalone on openshift cluster with katib. When I apply a pytorch job the worker pods are getting created but for some reason the master pods are not getting started.
Here is the events log of the worker pod:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 9m35s default-scheduler Successfully assigned sampler/random-exp-jw6qxmrm-worker-0 to acorvin-hpo-poc-jfrlm-worker-0-twvtz
Normal AddedInterface 9m33s multus Add eth0 [10.131.5.61/23] from openshift-sdn
Normal Pulling 9m33s kubelet Pulling image "quay.io/bharathappali/alpine:3.10"
Normal Pulled 9m32s kubelet Successfully pulled image "quay.io/bharathappali/alpine:3.10" in 1.065165424s (1.065174057s including waiting)
Warning BackOff 2m49s kubelet Back-off restarting failed container init-pytorch in pod random-exp-jw6qxmrm-worker-0_sampler(8d6860a7-204d-45c8-bb57-8d84a6cf8e66)
Normal Created 2m34s (x3 over 9m31s) kubelet Created container init-pytorch
Normal Started 2m34s (x3 over 9m31s) kubelet Started container init-pytorch
Normal Pulled 2m34s (x2 over 6m11s) kubelet Container image "quay.io/bharathappali/alpine:3.10" already present on machine
I have changed the init container image due to docker pull limits issue
Here is the pod log:
nslookup: can't resolve 'random-exp-jw6qxmrm-master-0': Name does not resolve
waiting for master
nslookup: can't resolve '(null)': Name does not resolve
Here is the pytorch experiment I'm deploying
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
name: random-exp
namespace: sampler
spec:
maxTrialCount: 25
parallelTrialCount: 3
maxFailedTrialCount: 3
resumePolicy: Never
objective:
type: maximize
goal: 0.9
objectiveMetricName: accuracy
additionalMetricNames: []
algorithm:
algorithmName: bayesianoptimization
algorithmSettings:
- name: base_estimator
value: GP
- name: n_initial_points
value: '10'
- name: acq_func
value: gp_hedge
- name: acq_optimizer
value: auto
parameters:
- name: lr
parameterType: double
feasibleSpace:
min: '0.01'
max: '0.03'
step: '0.01'
metricsCollectorSpec:
collector:
kind: StdOut
trialTemplate:
primaryContainerName: pytorch
successCondition: status.conditions.#(type=="Complete")#|#(status=="True")#
failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")#
retain: false
trialParameters:
- name: learningRate
reference: lr
description: ''
trialSpec:
apiVersion: kubeflow.org/v1
kind: PyTorchJob
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: quay.io/bharathappali/pytorch-mnist-cpu:v0.16.0
resources:
limits:
cpu: "1"
memory: "2Gi"
requests:
cpu: "1"
memory: "1Gi"
command:
- python3
- /opt/pytorch-mnist/mnist.py
- '--epochs=1'
- '--lr=${trialParameters.learningRate}'
- '--momentum=0.5'
Worker:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: quay.io/bharathappali/pytorch-mnist-cpu:v0.16.0
resources:
limits:
cpu: "1"
memory: "2Gi"
requests:
cpu: "1"
memory: "1Gi"
command:
- python3
- /opt/pytorch-mnist/mnist.py
- '--epochs=1'
- '--lr=${trialParameters.learningRate}'
- '--momentum=0.5'
Is master not up ? random-exp-jw6qxmrm-master-0 doesn't resolve
Yes the master pod is not getting scheduled. I see workers init failure and it shows crashloopbackoff
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
@bharathappali Sorry for the late reply, can you try to create your PyTorchJob without Katib Experiment ?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.