katib
katib copied to clipboard
Trial job is succeeded but metrics are not reported, reconcile requeued
/kind bug
What steps did you take and what happened: I just tried to run the random experiment example, through the Katib UI (I also tried creating an experiment using python, but the same error occurs).
Following the experiment creation with the UI, I only changed the trial template (YAML), with this:
apiVersion: batch/v1
kind: Job
metadata:
annotations:
sidecar.istio.io/inject: "false"
katib-metricscollector-injection: enabled
katib-metrics-collector-injection: enabled
spec:
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
katib-metricscollector-injection: enabled
katib-metrics-collector-injection: enabled
spec:
containers:
- name: training-container
image: docker.io/kubeflowkatib/mxnet-mnist:latest
command:
- "python3"
- "/opt/mxnet-mnist/mnist.py"
- "--batch-size=64"
- "--lr=${trialParameters.learningRate}"
- "--num-layers=${trialParameters.numberLayers}"
- "--optimizer=${trialParameters.optimizer}"
restartPolicy: Never
After a couple of minutes, the pods created by the job terminated, with the status Completed, and printed my objective metric as this:
2022-01-25T20:26:59Z INFO Epoch[9] Train-accuracy=0.993770
2022-01-25T20:26:59Z INFO Epoch[9] Time cost=5.344
2022-01-25T20:26:59Z INFO Epoch[9] Validation-accuracy=0.978802
But the experiment, suggestions, and trials keep with status Running, and new trials are not created.
When I check the katib-controller logs, I get the following msg:
{"level":"info","ts":1643142603.5533006,"logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":"kubeflow-user-example-com/random-experiment-vzkjcznm"}
{"level":"info","ts":1643142603.633143,"logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":"kubeflow-user-example-com/random-experiment-c9qr67ww"}
{"level":"info","ts":1643142603.655875,"logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":"kubeflow-user-example-com/random-experiment-smw6p6rg"}
Additional Information:
kubectl get experiment random-experiment -o yaml -n kubeflow-user-example-com
Results in:
Output
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
creationTimestamp: "2022-01-25T20:25:22Z"
finalizers:
- update-prometheus-metrics
generation: 1
name: random-experiment
namespace: kubeflow-user-example-com
resourceVersion: "126860285"
uid: 91283c82-46e4-4b8b-9a3a-5cb730ad41d6
spec:
algorithm:
algorithmName: random
maxFailedTrialCount: 3
maxTrialCount: 12
metricsCollectorSpec:
collector:
kind: StdOut
objective:
additionalMetricNames:
- Train-accuracy
goal: 0.05
objectiveMetricName: Validation-accuracy
type: maximize
parallelTrialCount: 3
parameters:
- feasibleSpace:
max: "0.03"
min: "0.01"
step: "0.01"
name: lr
parameterType: double
- feasibleSpace:
max: "64"
min: "1"
step: "1"
name: num-layers
parameterType: int
- feasibleSpace:
list:
- sgd
- adam
- ftrl
name: optimizer
parameterType: categorical
resumePolicy: LongRunning
trialTemplate:
failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")#
primaryContainerName: training-container
successCondition: status.conditions.#(type=="Complete")#|#(status=="True")#
trialParameters:
- name: learningRate
reference: lr
- name: numberLayers
reference: num-layers
- name: optimizer
reference: optimizer
trialSpec:
apiVersion: batch/v1
kind: Job
metadata:
annotations:
katib-metrics-collector-injection: enabled
katib-metricscollector-injection: enabled
sidecar.istio.io/inject: "false"
labels:
katib-metrics-collector-injection: enabled
katib-metricscollector-injection: enabled
sidecar.istio.io/inject: "false"
spec:
template:
metadata:
annotations:
katib-metrics-collector-injection: enabled
katib-metricscollector-injection: enabled
sidecar.istio.io/inject: "false"
labels:
katib-metrics-collector-injection: enabled
katib-metricscollector-injection: enabled
sidecar.istio.io/inject: "false"
spec:
containers:
- command:
- python3
- /opt/mxnet-mnist/mnist.py
- --batch-size=64
- --lr=${trialParameters.learningRate}
- --num-layers=${trialParameters.numberLayers}
- --optimizer=${trialParameters.optimizer}
image: docker.io/kubeflowkatib/mxnet-mnist:latest
name: training-container
restartPolicy: Never
status:
conditions:
- lastTransitionTime: "2022-01-25T20:25:22Z"
lastUpdateTime: "2022-01-25T20:25:22Z"
message: Experiment is created
reason: ExperimentCreated
status: "True"
type: Created
- lastTransitionTime: "2022-01-25T20:25:44Z"
lastUpdateTime: "2022-01-25T20:25:44Z"
message: Experiment is running
reason: ExperimentRunning
status: "True"
type: Running
currentOptimalTrial:
observation: {}
runningTrialList:
- random-experiment-smw6p6rg
- random-experiment-c9qr67ww
- random-experiment-vzkjcznm
startTime: "2022-01-25T20:25:22Z"
trials: 3
trialsRunning: 3
and
kubectl get trial random-experiment-c9qr67ww -n kubeflow-user-example-com -o yaml
Results in:
Output
apiVersion: kubeflow.org/v1beta1
kind: Trial
metadata:
creationTimestamp: "2022-01-25T20:25:44Z"
finalizers:
- clean-metrics-in-db
generation: 1
labels:
katib.kubeflow.org/experiment: random-experiment
name: random-experiment-c9qr67ww
namespace: kubeflow-user-example-com
ownerReferences:
- apiVersion: kubeflow.org/v1beta1
blockOwnerDeletion: true
controller: true
kind: Experiment
name: random-experiment
uid: 91283c82-46e4-4b8b-9a3a-5cb730ad41d6
resourceVersion: "126860266"
uid: 24a7d825-2737-4d6f-8ba8-5e22d776443f
spec:
failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")#
metricsCollector:
collector:
kind: StdOut
objective:
additionalMetricNames:
- Train-accuracy
goal: 0.05
objectiveMetricName: Validation-accuracy
type: maximize
parameterAssignments:
- name: lr
value: "0.018768621111940782"
- name: num-layers
value: "7"
- name: optimizer
value: sgd
primaryContainerName: training-container
runSpec:
apiVersion: batch/v1
kind: Job
metadata:
annotations:
katib-metrics-collector-injection: enabled
katib-metricscollector-injection: enabled
sidecar.istio.io/inject: "false"
labels:
katib-metrics-collector-injection: enabled
katib-metricscollector-injection: enabled
sidecar.istio.io/inject: "false"
name: random-experiment-c9qr67ww
namespace: kubeflow-user-example-com
spec:
template:
metadata:
annotations:
katib-metrics-collector-injection: enabled
katib-metricscollector-injection: enabled
sidecar.istio.io/inject: "false"
labels:
katib-metrics-collector-injection: enabled
katib-metricscollector-injection: enabled
sidecar.istio.io/inject: "false"
spec:
containers:
- command:
- python3
- /opt/mxnet-mnist/mnist.py
- --batch-size=64
- --lr=0.018768621111940782
- --num-layers=7
- --optimizer=sgd
image: docker.io/kubeflowkatib/mxnet-mnist:latest
name: training-container
restartPolicy: Never
successCondition: status.conditions.#(type=="Complete")#|#(status=="True")#
status:
conditions:
- lastTransitionTime: "2022-01-25T20:25:44Z"
lastUpdateTime: "2022-01-25T20:25:44Z"
message: Trial is created
reason: TrialCreated
status: "True"
type: Created
- lastTransitionTime: "2022-01-25T20:25:44Z"
lastUpdateTime: "2022-01-25T20:25:44Z"
message: Trial is running
reason: TrialRunning
status: "True"
type: Running
startTime: "2022-01-25T20:25:44Z"
What did you expect to happen: Ideally, once the metrics are captured and the goal/maxTrial is reached, the trial status should change to succeeded.
What am I missing?
Thanks
I'm suspicious about any other sidecar was injected or the katib-mysql pod was not healthy.
-
Could you print the result of
kubectl get pod -o wide -n kubeflow-user-example-com
when the pod was running and completed ? -
Could you print the status and logs of
katib-mysql
inkubeflow
namespace ?
This seems to be the same issue as multiple people. Here is my output from your request as I have the same results as above.
Are these issues all the same problem? https://github.com/kubeflow/katib/issues/1837 https://github.com/kubeflow/katib/issues/1836 https://github.com/kubeflow/katib/issues/1830
![Screen Shot 2022-03-22 at 1 02 25 PM](https://user-images.githubusercontent.com/98784768/159566380-6f951576-41d6-4753-906d-ee61293187ca.png)
![Screen Shot 2022-03-22 at 1 07 51 PM](https://user-images.githubusercontent.com/98784768/159566974-8cc85566-2f98-4fdf-a3ad-fa15bae3099e.png)
![Screen Shot 2022-03-22 at 1 09 03 PM](https://user-images.githubusercontent.com/98784768/159567047-6f4bac2c-fc63-4f08-b389-75f464c91b69.png)
kubectl -n kubeflow logs katib-controller-75d4b6db74-62qml {"level":"info","ts":1647978455.639091,"logger":"entrypoint","msg":"Config:","experiment-suggestion-name":"default","webhook-port":8443,"metrics-addr":":8080","inject-security-context":false,"enable-grpc-probe-in-suggestion":true,"trial-resources":[{"Group":"batch","Version":"v1","Kind":"Job"},{"Group":"kubeflow.org","Version":"v1","Kind":"TFJob"},{"Group":"kubeflow.org","Version":"v1","Kind":"PyTorchJob"},{"Group":"kubeflow.org","Version":"v1","Kind":"MPIJob"},{"Group":"kubeflow.org","Version":"v1","Kind":"XGBoostJob"},{"Group":"kubeflow.org","Version":"v1","Kind":"MXJob"}]} I0322 19:47:36.698282 1 request.go:655] Throttling request took 1.03357327s, request: GET:https://10.100.0.1:443/apis/dex.coreos.com/v1?timeout=32s {"level":"info","ts":1647978457.6716034,"logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":":8080"} {"level":"info","ts":1647978457.671948,"logger":"entrypoint","msg":"Registering Components."} {"level":"info","ts":1647978457.6743279,"logger":"entrypoint","msg":"Setting up controller."} {"level":"info","ts":1647978457.674349,"logger":"experiment-controller","msg":"Using the default suggestion implementation"} {"level":"info","ts":1647978457.6744256,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6744478,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6744595,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6744664,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.674474,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.674481,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.674485,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6744885,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.674493,"logger":"experiment-controller","msg":"Experiment controller created"} {"level":"info","ts":1647978457.674517,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6745222,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6745288,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6745324,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6745372,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6745434,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6745472,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6745508,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6745555,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6745596,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6745625,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.674566,"logger":"suggestion-controller","msg":"Suggestion controller created"} {"level":"info","ts":1647978457.6746092,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6746144,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.674662,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6746662,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6746714,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6746764,"logger":"trial-controller","msg":"Job watch added successfully","CRD Group":"batch","CRD Version":"v1","CRD Kind":"Job"} {"level":"info","ts":1647978457.6747062,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6747098,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6747131,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6747172,"logger":"trial-controller","msg":"Job watch added successfully","CRD Group":"kubeflow.org","CRD Version":"v1","CRD Kind":"TFJob"} {"level":"info","ts":1647978457.6747453,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6747487,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.674752,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6747553,"logger":"trial-controller","msg":"Job watch added successfully","CRD Group":"kubeflow.org","CRD Version":"v1","CRD Kind":"PyTorchJob"} {"level":"info","ts":1647978457.6747775,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6747808,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.674784,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6747878,"logger":"trial-controller","msg":"Job watch added successfully","CRD Group":"kubeflow.org","CRD Version":"v1","CRD Kind":"MPIJob"} {"level":"info","ts":1647978457.6748126,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6748164,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6748192,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6748228,"logger":"trial-controller","msg":"Job watch added successfully","CRD Group":"kubeflow.org","CRD Version":"v1","CRD Kind":"XGBoostJob"} {"level":"info","ts":1647978457.6748462,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.67485,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6748536,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.674857,"logger":"trial-controller","msg":"Job watch added successfully","CRD Group":"kubeflow.org","CRD Version":"v1","CRD Kind":"MXJob"} {"level":"info","ts":1647978457.6748607,"logger":"trial-controller","msg":"Trial controller created"} {"level":"info","ts":1647978457.674864,"logger":"entrypoint","msg":"Setting up webhooks."} {"level":"info","ts":1647978457.6748724,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.674973,"logger":"controller-runtime.webhook","msg":"registering webhook","path":"/validate-experiment"} {"level":"info","ts":1647978457.6749876,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6749978,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6750028,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6750445,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6750941,"logger":"controller-runtime.webhook","msg":"registering webhook","path":"/mutate-experiment"} {"level":"info","ts":1647978457.675098,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6751046,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.675108,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6751313,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6751745,"logger":"controller-runtime.webhook","msg":"registering webhook","path":"/mutate-pod"} {"level":"info","ts":1647978457.6751783,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6751873,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6751907,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.675213,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6752174,"logger":"entrypoint","msg":"Starting the Cmd."} {"level":"info","ts":1647978457.6864378,"logger":"controller-runtime.webhook.webhooks","msg":"starting webhook server"} {"level":"info","ts":1647978457.6866453,"logger":"controller-runtime.certwatcher","msg":"Updated current TLS certificate"} {"level":"info","ts":1647978457.686703,"logger":"controller-runtime.webhook","msg":"serving webhook server","host":"","port":8443} {"level":"info","ts":1647978457.6867871,"logger":"controller-runtime.manager.controller.experiment-controller","msg":"Starting EventSource","source":"kind source: /, Kind="} {"level":"info","ts":1647978457.6869707,"logger":"controller-runtime.manager.controller.suggestion-controller","msg":"Starting EventSource","source":"kind source: /, Kind="} {"level":"info","ts":1647978457.6870694,"logger":"controller-runtime.manager.controller.trial-controller","msg":"Starting EventSource","source":"kind source: /, Kind="} {"level":"info","ts":1647978457.6983204,"logger":"controller-runtime.certwatcher","msg":"Starting certificate watcher"} {"level":"info","ts":1647978457.6991751,"logger":"controller-runtime.manager","msg":"starting metrics server","path":"/metrics"} {"level":"info","ts":1647978457.7876654,"logger":"controller-runtime.manager.controller.trial-controller","msg":"Starting EventSource","source":"kind source: batch/v1, Kind=Job"} {"level":"info","ts":1647978457.78791,"logger":"controller-runtime.manager.controller.experiment-controller","msg":"Starting EventSource","source":"kind source: /, Kind="} {"level":"info","ts":1647978457.7879503,"logger":"controller-runtime.manager.controller.experiment-controller","msg":"Starting EventSource","source":"kind source: /, Kind="} {"level":"info","ts":1647978457.7879674,"logger":"controller-runtime.manager.controller.experiment-controller","msg":"Starting Controller"} {"level":"info","ts":1647978457.7880077,"logger":"controller-runtime.manager.controller.suggestion-controller","msg":"Starting EventSource","source":"kind source: /, Kind="} {"level":"info","ts":1647978457.8883045,"logger":"controller-runtime.manager.controller.suggestion-controller","msg":"Starting EventSource","source":"kind source: /, Kind="} {"level":"info","ts":1647978457.8885317,"logger":"controller-runtime.manager.controller.trial-controller","msg":"Starting EventSource","source":"kind source: kubeflow.org/v1, Kind=TFJob"} {"level":"info","ts":1647978457.9890933,"logger":"controller-runtime.manager.controller.experiment-controller","msg":"Starting workers","worker count":1} {"level":"info","ts":1647978457.9897153,"logger":"controller-runtime.manager.controller.trial-controller","msg":"Starting EventSource","source":"kind source: kubeflow.org/v1, Kind=PyTorchJob"} {"level":"info","ts":1647978457.9901621,"logger":"controller-runtime.manager.controller.suggestion-controller","msg":"Starting EventSource","source":"kind source: /, Kind="} {"level":"info","ts":1647978458.0914237,"logger":"controller-runtime.manager.controller.trial-controller","msg":"Starting EventSource","source":"kind source: kubeflow.org/v1, Kind=MPIJob"} {"level":"info","ts":1647978458.0920405,"logger":"controller-runtime.manager.controller.suggestion-controller","msg":"Starting Controller"} {"level":"info","ts":1647978458.1927216,"logger":"controller-runtime.manager.controller.trial-controller","msg":"Starting EventSource","source":"kind source: kubeflow.org/v1, Kind=XGBoostJob"} {"level":"info","ts":1647978458.293133,"logger":"controller-runtime.manager.controller.trial-controller","msg":"Starting EventSource","source":"kind source: kubeflow.org/v1, Kind=MXJob"} {"level":"info","ts":1647978458.2938151,"logger":"controller-runtime.manager.controller.suggestion-controller","msg":"Starting workers","worker count":1} {"level":"info","ts":1647978458.3940845,"logger":"controller-runtime.manager.controller.trial-controller","msg":"Starting Controller"} {"level":"info","ts":1647978458.5143259,"logger":"controller-runtime.manager.controller.trial-controller","msg":"Starting workers","worker count":1} {"level":"info","ts":1647978458.5252428,"logger":"suggestion-controller","msg":"Sync assignments","Suggestion":"kubeflow-user-example-com/random2","Suggestion Requests":3,"Suggestion Count":3} {"level":"info","ts":1647978458.5519454,"logger":"suggestion-controller","msg":"Sync assignments","Suggestion":"kubeflow-user-example-com/pytorchjob-mnist","Suggestion Requests":3,"Suggestion Count":3} 2022/03/22 19:47:52 http: TLS handshake error from 192.168.101.32:51638: remote error: tls: bad certificate {"level":"info","ts":1647978486.5489805,"logger":"suggestion-controller","msg":"Sync assignments","Suggestion":"kubeflow/random","Suggestion Requests":3,"Suggestion Count":3} {"level":"info","ts":1647978486.5796025,"logger":"suggestion-controller","msg":"Sync assignments","Suggestion":"kubeflow/random","Suggestion Requests":3,"Suggestion Count":3}
This is typically a Katib deployment or Experiment config issue.
Adding steps to debug for future ref
An experiment spawns multiple trials. For each Trial, a worker Job is created from the TrialSpec of the Experiment Yaml. See https://github.com/kubeflow/katib/blob/master/examples/v1beta1/hp-tuning/random.yaml#L49. When worker Job is created, a metrics collector sidecar is created. Once worker Job is succeeded, metrics collector sidecar collects metrics from the main container and submits to the Katib DB Manager. If latter cannot be completed. Trial will not be successful even though underlying worker is successful.
Check Experiment status and Trial status
Does it have MetricUnavailable Trials? If Yes, then check the following
- Is Katib DB manager deployment up and Running? If not, underlying database(mysql as default) is not ready
- Check logs of metrics collector side car to see if it is able to send metric to Katib DB manager. If not, either metrics cannot be collected(wrong log format) or cannot be sent to DB manager(network settings issue)
/cc @tenzen-y
As described by @johnugeorge, katib-controller adds MetricUnavailable
status to your Experiment and Trial resources in the following cases:
- katib-controller failed to inject metrics-collector as a sidecar container.
- metrics-collector can not collect any metrics or failed to report katib-db-manager.
If you faced the first case, please check network policies or istio sidecar settings and so on. If you faced the second case, please check even if your training code output appropriate format logs. If you are having trouble about setting regular expressions, I would suggest json-format-logs.
example for json-format-logs: https://github.com/kubeflow/katib/blob/478e01d612c0d1c05c7fceea29edbea922b30b9e/examples/v1beta1/metrics-collector/file-metrics-collector-with-json-format.yaml
Also, MetricUnavailable
status was introduced in katib v0.14.0-rc.0, so if you are using an older version, please update it.
@johnugeorge @tenzen-y thanks for the explanation. In my case, adding metricsCollectorSpec:
to the grid example solved the issue. I am able to click the Trials/YAML/Details tabs in the Experiment view in the Web UI, and though I am still waiting for the trials to finish, I am positive they will succeed eventually. I have also increased the number of epochs to allow the script to reach an optimal.
Is this a valid solution? If so, I can gladly update some of the examples in kubeflow/katib that are missing the metrics spec.
@DnPlas That is bit strange. If metric collector is not specified, it is defaulted to StdOutCollector. https://github.com/kubeflow/katib/blob/3906f8e8e00b5923be05f5b9a07442baa0ada2fa/pkg/apis/controller/experiments/v1beta1/experiment_defaults.go#L131
Are your webhooks working correctly? https://github.com/kubeflow/katib/blob/3906f8e8e00b5923be05f5b9a07442baa0ada2fa/pkg/webhook/v1beta1/webhook.go#L45
@johnugeorge My webhook is in the cluster, though I am not sure if there is a misconfiguration.
Once you apply grid example, can you verify if metricsCollectorSpec is populated automatically with `kubectl get experiment ? If not, you need to verify why webhooks are not working in your k8s cluster. Are webhooks working for non-katib components?
@johnugeorge my trials' metricsCollector
is empty, I will check the webhook and try to see what might gone wrong.
spec:
metricsCollector: {}
I have the same situation as @dnplas, and the Experiment's metricsCollectorSpec
is absent (no key), so I suspect the webhook handling the experiment is misbehaving but I'm not sure how
While debugging this for my case, I noticed the katib mutatingwebhooks are set to failurePolicy: Ignore
. Setting that to Fail
gave me some better hints to go on:
kubectl create -f custom-metrics-collector.yaml [4:47:19]
Error from server (InternalError): error when creating "custom-metrics-collector.yaml": Internal error occurred: failed calling webhook "defaulter.experiment.katib.kubeflow.org": could not get REST client: unable to load root certificates: unable to parse bytes as PEM block
So it appears I have some other configuration problems around my certs to dig into
I was able to fix my issue, though I discovered mine was different than @ccastro-pedro, as I was not getting the sidecar metrics container injected in the trials pods due to some missing webhook certs.
In order to further debug (in case anyone runs into similar issues), I suggest to:
- Check the above comments from @johnugeorge and @tenzen-y
- Check the validation and webhook certificates
- Change the
failurePolicy
toFail
, it helps to catch errors in the early stages and prevents things to fail silently
- Change the
- Verify that the metrics container can communicate with the katib-db
@charlescurt your issue looks similar to what I was experiencing. In my case, I had to ensure the caBundle
was correctly populated in the validation and mutating webhooks. Check the certificates at /tmp/cert
in the webhook server and make sure you can make POST requests to it using the certs in your webhooks.
Dear @DnPlas :
In order to further debug (in case anyone runs into similar issues), I suggest to:
Check the above comments from @johnugeorge and @tenzen-y
Check the validation and webhook certificates
- Change the
failurePolicy
toFail
, it helps to catch errors in the early stages and prevents things to fail silently
Could you please indicate what do you refer exactly with that? In which pod should I change the failurePolicy to fail? I have not found the pod in which I can do it.
- Verify that the metrics container can communicate with the katib-db
Could you specify a bit more how I can verify that? Thank you in advance.
@charlescurt your issue looks similar to what I was experiencing. In my case, I had to ensure the
caBundle
was correctly populated in the validation and mutating webhooks. Check the certificates at/tmp/cert
in the webhook server and make sure you can make POST requests to it using the certs in your webhooks.
@joaquingarciaatos After facing this same issue, I eventually realized the problem was that while the webhook service listens on port 443, the webhook pod listens on port 8443 and network traffic to 8443 was blocked in our infrastructure setup. Opening port 8443 on our system nodes solved the problem.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
In the latest Katib version, failurePolicy: Fail
by default: https://github.com/kubeflow/katib/blob/master/CHANGELOG.md#v0160-rc0-2023-08-05
So the Webhook errors can be caught easily.
Feel free to open a new issue, if you see any other problems.