katib
katib copied to clipboard
Pytorch Job not getting detected by the training operator when launched via katib
What steps did you take and what happened:
I'm a new katib user and I have installed katib and training operator in standalone installation on my local minikube cluster, and I have tried to start a HPO run to tune the hyper parameters of a PyTorch Job (distributed training example) . The experiment is not getting started as the training operator was not able to detect the kind PyTorchJob
. Am I missing to configure anything? I was just using the default pytorch example which is mentioned in the Katib UI.
Training operator log:
[abharath@abharath-thinkpadt14sgen2i ~]$ kubectl -n kubeflow logs -f training-operator-984cfd546-fqpdk
2024-01-30T08:00:45Z INFO controller-runtime.metrics Metrics server is starting to listen {"addr": ":8080"}
2024-01-30T08:00:45Z INFO setup starting manager
2024-01-30T08:00:45Z INFO setup registering controllers...
2024-01-30T08:00:45Z INFO Starting server {"kind": "health probe", "addr": "[::]:8081"}
2024-01-30T08:00:45Z INFO starting server {"path": "/metrics", "kind": "metrics", "addr": "[::]:8080"}
2024-01-30T08:00:45Z INFO Starting EventSource {"controller": "mxjob-controller", "source": "kind source: *v1.MXJob"}
2024-01-30T08:00:45Z INFO Starting EventSource {"controller": "mxjob-controller", "source": "kind source: *v1.Pod"}
2024-01-30T08:00:45Z INFO Starting EventSource {"controller": "mxjob-controller", "source": "kind source: *v1.Service"}
2024-01-30T08:00:45Z INFO Starting Controller {"controller": "mxjob-controller"}
2024-01-30T08:00:45Z INFO Starting EventSource {"controller": "xgboostjob-controller", "source": "kind source: *v1.XGBoostJob"}
2024-01-30T08:00:45Z INFO Starting EventSource {"controller": "xgboostjob-controller", "source": "kind source: *v1.Pod"}
2024-01-30T08:00:45Z INFO Starting EventSource {"controller": "xgboostjob-controller", "source": "kind source: *v1.Service"}
2024-01-30T08:00:45Z INFO Starting Controller {"controller": "xgboostjob-controller"}
2024-01-30T08:00:45Z INFO Starting EventSource {"controller": "mpijob-controller", "source": "kind source: *v1.MPIJob"}
2024-01-30T08:00:45Z INFO Starting EventSource {"controller": "mpijob-controller", "source": "kind source: *v1.Pod"}
2024-01-30T08:00:45Z INFO Starting EventSource {"controller": "mpijob-controller", "source": "kind source: *v1.ConfigMap"}
2024-01-30T08:00:45Z INFO Starting EventSource {"controller": "mpijob-controller", "source": "kind source: *v1.Role"}
2024-01-30T08:00:45Z INFO Starting EventSource {"controller": "mpijob-controller", "source": "kind source: *v1.RoleBinding"}
2024-01-30T08:00:45Z INFO Starting EventSource {"controller": "mpijob-controller", "source": "kind source: *v1.ServiceAccount"}
2024-01-30T08:00:45Z INFO Starting Controller {"controller": "mpijob-controller"}
2024-01-30T08:00:45Z INFO Starting EventSource {"controller": "paddlejob-controller", "source": "kind source: *v1.PaddleJob"}
2024-01-30T08:00:45Z INFO Starting EventSource {"controller": "paddlejob-controller", "source": "kind source: *v1.Pod"}
2024-01-30T08:00:45Z INFO Starting EventSource {"controller": "paddlejob-controller", "source": "kind source: *v1.Service"}
2024-01-30T08:00:45Z INFO Starting Controller {"controller": "paddlejob-controller"}
2024-01-30T08:00:45Z INFO Starting EventSource {"controller": "tfjob-controller", "source": "kind source: *v1.TFJob"}
2024-01-30T08:00:45Z INFO Starting EventSource {"controller": "tfjob-controller", "source": "kind source: *v1.Pod"}
2024-01-30T08:00:45Z INFO Starting EventSource {"controller": "tfjob-controller", "source": "kind source: *v1.Service"}
2024-01-30T08:00:45Z INFO Starting Controller {"controller": "tfjob-controller"}
2024-01-30T08:00:45Z INFO Starting EventSource {"controller": "pytorchjob-controller", "source": "kind source: *v1.PyTorchJob"}
2024-01-30T08:00:45Z INFO Starting EventSource {"controller": "pytorchjob-controller", "source": "kind source: *v1.Pod"}
2024-01-30T08:00:45Z INFO Starting EventSource {"controller": "pytorchjob-controller", "source": "kind source: *v1.Service"}
2024-01-30T08:00:45Z INFO Starting Controller {"controller": "pytorchjob-controller"}
2024-01-30T08:00:45Z INFO Starting workers {"controller": "paddlejob-controller", "worker count": 1}
2024-01-30T08:00:45Z INFO Starting workers {"controller": "xgboostjob-controller", "worker count": 1}
2024-01-30T08:00:45Z INFO Starting workers {"controller": "mpijob-controller", "worker count": 1}
2024-01-30T08:00:45Z INFO Starting workers {"controller": "tfjob-controller", "worker count": 1}
2024-01-30T08:00:45Z INFO Starting workers {"controller": "mxjob-controller", "worker count": 1}
2024-01-30T08:00:45Z INFO Starting workers {"controller": "pytorchjob-controller", "worker count": 1}
Katib controller log:
{"level":"error","ts":"2024-01-30T12:10:32Z","logger":"trial-controller","msg":"Reconcile job error","Trial":{"name":"random-experiment-scpkhp2b","namespace":"kubeflow-user"},"error":"no matches for kind \"PyTorchJob\" in version \"kubeflow.org/v1\"","stacktrace":"github.com/kubeflow/katib/pkg/controller.v1beta1/trial.(*ReconcileTrial).reconcileTrial\n\t/go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/trial/trial_controller.go:221\ngithub.com/kubeflow/katib/pkg/controller.v1beta1/trial.(*ReconcileTrial).Reconcile\n\t/go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/trial/trial_controller.go:180\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:118\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:314\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:265\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:226"}
CRD list
[abharath@abharath-thinkpadt14sgen2i ~]$ kubectl get crds
NAME CREATED AT
experiments.kubeflow.org 2024-01-30T07:19:28Z
mpijobs.kubeflow.org 2024-01-30T08:00:30Z
mxjobs.kubeflow.org 2024-01-30T08:00:31Z
paddlejobs.kubeflow.org 2024-01-30T08:00:31Z
pytorchjobs.kubeflow.org 2024-01-30T08:00:31Z
suggestions.kubeflow.org 2024-01-30T07:19:28Z
tfjobs.kubeflow.org 2024-01-30T08:00:31Z
trials.kubeflow.org 2024-01-30T07:19:28Z
xgboostjobs.kubeflow.org 2024-01-30T08:00:31Z
Environment:
- Katib version (check the Katib controller image version): 0.16.0
- Kubernetes version: (
kubectl version
):
[abharath@abharath-thinkpadt14sgen2i gpuXplore]$ kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.10", GitCommit:"0fa26aea1d5c21516b0d96fea95a77d8d429912e", GitTreeState:"archive", BuildDate:"2024-01-18T00:00:00Z", GoVersion:"go1.21.6", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.3", GitCommit:"aef86a93758dc3cb2c658dd9657ab4ad4afc21cb", GitTreeState:"clean", BuildDate:"2022-07-13T14:23:26Z", GoVersion:"go1.18.3", Compiler:"gc", Platform:"linux/amd64"}
WARNING: version difference between client (1.27) and server (1.24) exceeds the supported minor version skew of +/-1
- OS (
uname -a
):
[abharath@abharath-thinkpadt14sgen2i gpuXplore]$ uname -a
Linux abharath-thinkpadt14sgen2i.remote.csb 6.7.3-200.fc39.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Feb 1 03:29:52 UTC 2024 x86_64 GNU/Linux
Impacted by this bug? Give it a 👍 We prioritize the issues with the most 👍
@bharathappali Did you deploy the training-operator first? Also, could you share your Experiment YAML with us?
Thanks for the response @tenzen-y
Did you deploy the training-operator first?
No I have deployed katib and later I deployed the training operator, should I install training operator first?
could you share your Experiment YAML with us?
Yes I have created the experiment again and here is the YAML from katib UI
metadata:
name: random-experiment
namespace: default
uid: 86f7d5ca-b125-47c1-971b-0e0bc3906e41
resourceVersion: '11624'
generation: 1
creationTimestamp: '2024-02-27T10:37:09Z'
finalizers:
- update-prometheus-metrics
managedFields:
- manager: Go-http-client
operation: Update
apiVersion: kubeflow.org/v1beta1
time: '2024-02-27T10:37:09Z'
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:finalizers:
.: {}
v:"update-prometheus-metrics": {}
f:spec:
.: {}
f:algorithm:
.: {}
f:algorithmName: {}
f:algorithmSettings: {}
f:maxFailedTrialCount: {}
f:maxTrialCount: {}
f:metricsCollectorSpec:
.: {}
f:collector:
.: {}
f:kind: {}
f:objective:
.: {}
f:additionalMetricNames: {}
f:goal: {}
f:objectiveMetricName: {}
f:type: {}
f:parallelTrialCount: {}
f:parameters: {}
f:resumePolicy: {}
f:trialTemplate:
.: {}
f:failureCondition: {}
f:primaryContainerName: {}
f:successCondition: {}
f:trialParameters: {}
f:trialSpec:
.: {}
f:apiVersion: {}
f:kind: {}
f:spec:
.: {}
f:pytorchReplicaSpecs:
.: {}
f:Master:
.: {}
f:replicas: {}
f:restartPolicy: {}
f:template:
.: {}
f:spec:
.: {}
f:containers: {}
f:Worker:
.: {}
f:replicas: {}
f:restartPolicy: {}
f:template:
.: {}
f:spec:
.: {}
f:containers: {}
- manager: Go-http-client
operation: Update
apiVersion: kubeflow.org/v1beta1
time: '2024-02-27T10:38:00Z'
fieldsType: FieldsV1
fieldsV1:
f:status:
.: {}
f:conditions: {}
f:currentOptimalTrial:
.: {}
f:observation: {}
f:pendingTrialList: {}
f:startTime: {}
f:trials: {}
f:trialsPending: {}
subresource: status
spec:
parameters:
- name: lr
parameterType: double
feasibleSpace:
max: '0.03'
min: '0.01'
step: '0.01'
objective:
type: maximize
goal: 0.99
objectiveMetricName: Validation-accuracy
additionalMetricNames:
- Train-accuracy
metricStrategies:
- name: Validation-accuracy
value: max
- name: Train-accuracy
value: max
algorithm:
algorithmName: bayesianoptimization
algorithmSettings:
- name: base_estimator
value: GP
- name: n_initial_points
value: '10'
- name: acq_func
value: gp_hedge
- name: acq_optimizer
value: auto
trialTemplate:
trialSpec:
apiVersion: kubeflow.org/v1
kind: PyTorchJob
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- command:
- python3
- /opt/pytorch-mnist/mnist.py
- '--epochs=1'
- '--lr=${trialParameters.learningRate}'
- '--momentum=0.5'
image: docker.io/kubeflowkatib/pytorch-mnist-cpu:v0.16.0
name: pytorch
Worker:
replicas: 2
restartPolicy: OnFailure
template:
spec:
containers:
- command:
- python3
- /opt/pytorch-mnist/mnist.py
- '--epochs=1'
- '--lr=${trialParameters.learningRate}'
- '--momentum=0.5'
image: docker.io/kubeflowkatib/pytorch-mnist-cpu:v0.16.0
name: pytorch
trialParameters:
- name: learningRate
reference: lr
primaryPodLabels:
training.kubeflow.org/job-role: master
primaryContainerName: training-container
successCondition: status.conditions.#(type=="Complete")#|#(status=="True")#
failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")#
parallelTrialCount: 3
maxTrialCount: 12
maxFailedTrialCount: 3
metricsCollectorSpec:
collector:
kind: StdOut
resumePolicy: Never
status:
startTime: '2024-02-27T10:37:09Z'
conditions:
- type: Created
status: 'True'
reason: ExperimentCreated
message: Experiment is created
lastUpdateTime: '2024-02-27T10:37:09Z'
lastTransitionTime: '2024-02-27T10:37:09Z'
- type: Running
status: 'True'
reason: ExperimentRunning
message: Experiment is running
lastUpdateTime: '2024-02-27T10:38:00Z'
lastTransitionTime: '2024-02-27T10:38:00Z'
currentOptimalTrial:
observation: {}
pendingTrialList:
- random-experiment-65rvn6c7
- random-experiment-lhdx4rjr
- random-experiment-8gtk2z74
trials: 3
trialsPending: 3
No I have deployed katib and later I deployed the training operator, should I install training operator first?
Oh, I see. We must deploy the training-operator first. If we deployed the Katib first, we need to restart Katib Controller Pod after we deployed the training-operator. Could you confirm on your local?
Thanks I have restarted katib and I have tried to run the pytorch job (distributed training example) but when I create experiment with this yaml
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
name: random-experiment
namespace: ''
spec:
maxTrialCount: 12
parallelTrialCount: 3
maxFailedTrialCount: 3
resumePolicy: Never
objective:
type: maximize
goal: 0.9
objectiveMetricName: accuracy
additionalMetricNames: []
algorithm:
algorithmName: bayesianoptimization
algorithmSettings:
- name: base_estimator
value: GP
- name: n_initial_points
value: '10'
- name: acq_func
value: gp_hedge
- name: acq_optimizer
value: auto
parameters:
- name: lr
parameterType: double
feasibleSpace:
min: '0.01'
max: '0.03'
step: '0.01'
metricsCollectorSpec:
collector:
kind: StdOut
trialTemplate:
primaryContainerName: training-container
successCondition: status.conditions.#(type=="Complete")#|#(status=="True")#
failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")#
retain: false
trialParameters:
- name: learningRate
reference: lr
description: ''
trialSpec:
apiVersion: kubeflow.org/v1
kind: PyTorchJob
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: docker.io/kubeflowkatib/pytorch-mnist-cpu:v0.16.0
command:
- python3
- /opt/pytorch-mnist/mnist.py
- '--epochs=1'
- '--lr=${trialParameters.learningRate}'
- '--momentum=0.5'
Worker:
replicas: 2
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: docker.io/kubeflowkatib/pytorch-mnist-cpu:v0.16.0
command:
- python3
- /opt/pytorch-mnist/mnist.py
- '--epochs=1'
- '--lr=${trialParameters.learningRate}'
- '--momentum=0.5'
This is the UI generated YAML and the training operator was unable to find training-container
so I change the container names to training-container
and I got this error.
2024-02-27T11:10:40Z ERROR PyTorchJob failed validation {"pytorchjob": {"name":"random-experiment-bjl5fqvl","namespace":"default"}, "error": "PyTorchJobSpec is not valid: There is no container named pytorch in Master"}
2024-02-27T11:10:40Z ERROR Reconciler error {"controller": "pytorchjob-controller", "object": {"name":"random-experiment-bjl5fqvl","namespace":"default"}, "namespace": "default", "name": "random-experiment-bjl5fqvl", "reconcileID": "2fb2991b-59fb-4d83-85ba-30c1765e7978", "error": "PyTorchJobSpec is not valid: There is no container named pytorch in Master"}
Later I changed the primary container name as pytorch
as per PyTorchJobSpec and I can see the pods getting created
[abharath@abharath-thinkpadt14sgen2i ~]$ kubectl get pods
NAME READY STATUS RESTARTS AGE
random-experiment-bayesianoptimization-5749c87757-nzg2l 1/1 Running 0 4m24s
random-experiment-kf8zd5hs-master-0 0/2 ContainerCreating 0 4m2s
random-experiment-kf8zd5hs-worker-0 0/1 Init:0/1 1 (37s ago) 4m2s
random-experiment-kf8zd5hs-worker-1 0/1 Init:0/1 1 (37s ago) 4m2s
random-experiment-nfhzbkq4-master-0 0/2 ContainerCreating 0 4m3s
random-experiment-nfhzbkq4-worker-0 0/1 Init:0/1 1 (37s ago) 4m2s
random-experiment-nfhzbkq4-worker-1 0/1 Init:0/1 1 (37s ago) 4m2s
random-experiment-pvwccdnr-master-0 0/2 ContainerCreating 0 4m3s
random-experiment-pvwccdnr-worker-0 0/1 Init:0/1 1 (41s ago) 4m3s
random-experiment-pvwccdnr-worker-1 0/1 Init:0/1 1 (41s ago) 4m3s
Thanks @tenzen-y
Thank you for creating this @bharathappali.
That's correct, you have to name container as pytorch
for PyTorchJob.
Did your Katib Trials succeed after you renamed the container name and primaryContainerName
?
Yes @andreyvelich I was able to run katib trials after the changes
Thank you! Closing this issue.