katib icon indicating copy to clipboard operation
katib copied to clipboard

Trial job is succeeded but metrics are not reported, reconcile requeued

Open ccastro-pedro opened this issue 2 years ago • 14 comments

/kind bug

What steps did you take and what happened: I just tried to run the random experiment example, through the Katib UI (I also tried creating an experiment using python, but the same error occurs).

Following the experiment creation with the UI, I only changed the trial template (YAML), with this:

apiVersion: batch/v1
kind: Job
metadata:
  annotations:
    sidecar.istio.io/inject: "false"
    katib-metricscollector-injection: enabled
    katib-metrics-collector-injection: enabled
spec:
  template:
    metadata:
      annotations:
        sidecar.istio.io/inject: "false"
        katib-metricscollector-injection: enabled
        katib-metrics-collector-injection: enabled     
    spec: 
      containers:
        - name: training-container
          image: docker.io/kubeflowkatib/mxnet-mnist:latest
          command:
            - "python3"
            - "/opt/mxnet-mnist/mnist.py"
            - "--batch-size=64"
            - "--lr=${trialParameters.learningRate}"
            - "--num-layers=${trialParameters.numberLayers}"
            - "--optimizer=${trialParameters.optimizer}"
      restartPolicy: Never

After a couple of minutes, the pods created by the job terminated, with the status Completed, and printed my objective metric as this:

2022-01-25T20:26:59Z INFO     Epoch[9] Train-accuracy=0.993770
2022-01-25T20:26:59Z INFO     Epoch[9] Time cost=5.344
2022-01-25T20:26:59Z INFO     Epoch[9] Validation-accuracy=0.978802

But the experiment, suggestions, and trials keep with status Running, and new trials are not created.

When I check the katib-controller logs, I get the following msg:

{"level":"info","ts":1643142603.5533006,"logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":"kubeflow-user-example-com/random-experiment-vzkjcznm"}
{"level":"info","ts":1643142603.633143,"logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":"kubeflow-user-example-com/random-experiment-c9qr67ww"}
{"level":"info","ts":1643142603.655875,"logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":"kubeflow-user-example-com/random-experiment-smw6p6rg"}

Additional Information:

kubectl get experiment random-experiment -o yaml -n kubeflow-user-example-com

Results in:

Output

apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  creationTimestamp: "2022-01-25T20:25:22Z"
  finalizers:
  - update-prometheus-metrics
  generation: 1
  name: random-experiment
  namespace: kubeflow-user-example-com
  resourceVersion: "126860285"
  uid: 91283c82-46e4-4b8b-9a3a-5cb730ad41d6
spec:
  algorithm:
    algorithmName: random
  maxFailedTrialCount: 3
  maxTrialCount: 12
  metricsCollectorSpec:
    collector:
      kind: StdOut
  objective:
    additionalMetricNames:
    - Train-accuracy
    goal: 0.05
    objectiveMetricName: Validation-accuracy
    type: maximize
  parallelTrialCount: 3
  parameters:
  - feasibleSpace:
      max: "0.03"
      min: "0.01"
      step: "0.01"
    name: lr
    parameterType: double
  - feasibleSpace:
      max: "64"
      min: "1"
      step: "1"
    name: num-layers
    parameterType: int
  - feasibleSpace:
      list:
      - sgd
      - adam
      - ftrl
    name: optimizer
    parameterType: categorical
  resumePolicy: LongRunning
  trialTemplate:
    failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")#
    primaryContainerName: training-container
    successCondition: status.conditions.#(type=="Complete")#|#(status=="True")#
    trialParameters:
    - name: learningRate
      reference: lr
    - name: numberLayers
      reference: num-layers
    - name: optimizer
      reference: optimizer
    trialSpec:
      apiVersion: batch/v1
      kind: Job
      metadata:
        annotations:
          katib-metrics-collector-injection: enabled
          katib-metricscollector-injection: enabled
          sidecar.istio.io/inject: "false"
        labels:
          katib-metrics-collector-injection: enabled
          katib-metricscollector-injection: enabled
          sidecar.istio.io/inject: "false"
      spec:
        template:
          metadata:
            annotations:
              katib-metrics-collector-injection: enabled
              katib-metricscollector-injection: enabled
              sidecar.istio.io/inject: "false"
            labels:
              katib-metrics-collector-injection: enabled
              katib-metricscollector-injection: enabled
              sidecar.istio.io/inject: "false"
          spec:
            containers:
            - command:
              - python3
              - /opt/mxnet-mnist/mnist.py
              - --batch-size=64
              - --lr=${trialParameters.learningRate}
              - --num-layers=${trialParameters.numberLayers}
              - --optimizer=${trialParameters.optimizer}
              image: docker.io/kubeflowkatib/mxnet-mnist:latest
              name: training-container
            restartPolicy: Never
status:
  conditions:
  - lastTransitionTime: "2022-01-25T20:25:22Z"
    lastUpdateTime: "2022-01-25T20:25:22Z"
    message: Experiment is created
    reason: ExperimentCreated
    status: "True"
    type: Created
  - lastTransitionTime: "2022-01-25T20:25:44Z"
    lastUpdateTime: "2022-01-25T20:25:44Z"
    message: Experiment is running
    reason: ExperimentRunning
    status: "True"
    type: Running
  currentOptimalTrial:
    observation: {}
  runningTrialList:
  - random-experiment-smw6p6rg
  - random-experiment-c9qr67ww
  - random-experiment-vzkjcznm
  startTime: "2022-01-25T20:25:22Z"
  trials: 3
  trialsRunning: 3

and

kubectl get trial random-experiment-c9qr67ww -n  kubeflow-user-example-com  -o yaml 

Results in:

Output

apiVersion: kubeflow.org/v1beta1
kind: Trial
metadata:
  creationTimestamp: "2022-01-25T20:25:44Z"
  finalizers:
  - clean-metrics-in-db
  generation: 1
  labels:
    katib.kubeflow.org/experiment: random-experiment
  name: random-experiment-c9qr67ww
  namespace: kubeflow-user-example-com
  ownerReferences:
  - apiVersion: kubeflow.org/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: Experiment
    name: random-experiment
    uid: 91283c82-46e4-4b8b-9a3a-5cb730ad41d6
  resourceVersion: "126860266"
  uid: 24a7d825-2737-4d6f-8ba8-5e22d776443f
spec:
  failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")#
  metricsCollector:
    collector:
      kind: StdOut
  objective:
    additionalMetricNames:
    - Train-accuracy
    goal: 0.05
    objectiveMetricName: Validation-accuracy
    type: maximize
  parameterAssignments:
  - name: lr
    value: "0.018768621111940782"
  - name: num-layers
    value: "7"
  - name: optimizer
    value: sgd
  primaryContainerName: training-container
  runSpec:
    apiVersion: batch/v1
    kind: Job
    metadata:
      annotations:
        katib-metrics-collector-injection: enabled
        katib-metricscollector-injection: enabled
        sidecar.istio.io/inject: "false"
      labels:
        katib-metrics-collector-injection: enabled
        katib-metricscollector-injection: enabled
        sidecar.istio.io/inject: "false"
      name: random-experiment-c9qr67ww
      namespace: kubeflow-user-example-com
    spec:
      template:
        metadata:
          annotations:
            katib-metrics-collector-injection: enabled
            katib-metricscollector-injection: enabled
            sidecar.istio.io/inject: "false"
          labels:
            katib-metrics-collector-injection: enabled
            katib-metricscollector-injection: enabled
            sidecar.istio.io/inject: "false"
        spec:
          containers:
          - command:
            - python3
            - /opt/mxnet-mnist/mnist.py
            - --batch-size=64
            - --lr=0.018768621111940782
            - --num-layers=7
            - --optimizer=sgd
            image: docker.io/kubeflowkatib/mxnet-mnist:latest
            name: training-container
          restartPolicy: Never
  successCondition: status.conditions.#(type=="Complete")#|#(status=="True")#
status:
  conditions:
  - lastTransitionTime: "2022-01-25T20:25:44Z"
    lastUpdateTime: "2022-01-25T20:25:44Z"
    message: Trial is created
    reason: TrialCreated
    status: "True"
    type: Created
  - lastTransitionTime: "2022-01-25T20:25:44Z"
    lastUpdateTime: "2022-01-25T20:25:44Z"
    message: Trial is running
    reason: TrialRunning
    status: "True"
    type: Running
  startTime: "2022-01-25T20:25:44Z"

What did you expect to happen: Ideally, once the metrics are captured and the goal/maxTrial is reached, the trial status should change to succeeded.

What am I missing?

Thanks

ccastro-pedro avatar Jan 25 '22 20:01 ccastro-pedro

I'm suspicious about any other sidecar was injected or the katib-mysql pod was not healthy.

  1. Could you print the result of kubectl get pod -o wide -n kubeflow-user-example-com when the pod was running and completed ?

  2. Could you print the status and logs of katib-mysql in kubeflow namespace ?

anencore94 avatar Feb 02 '22 14:02 anencore94

This seems to be the same issue as multiple people. Here is my output from your request as I have the same results as above.

Are these issues all the same problem? https://github.com/kubeflow/katib/issues/1837 https://github.com/kubeflow/katib/issues/1836 https://github.com/kubeflow/katib/issues/1830

Screen Shot 2022-03-22 at 1 02 25 PM Screen Shot 2022-03-22 at 1 07 51 PM Screen Shot 2022-03-22 at 1 09 03 PM

kubectl -n kubeflow logs katib-controller-75d4b6db74-62qml {"level":"info","ts":1647978455.639091,"logger":"entrypoint","msg":"Config:","experiment-suggestion-name":"default","webhook-port":8443,"metrics-addr":":8080","inject-security-context":false,"enable-grpc-probe-in-suggestion":true,"trial-resources":[{"Group":"batch","Version":"v1","Kind":"Job"},{"Group":"kubeflow.org","Version":"v1","Kind":"TFJob"},{"Group":"kubeflow.org","Version":"v1","Kind":"PyTorchJob"},{"Group":"kubeflow.org","Version":"v1","Kind":"MPIJob"},{"Group":"kubeflow.org","Version":"v1","Kind":"XGBoostJob"},{"Group":"kubeflow.org","Version":"v1","Kind":"MXJob"}]} I0322 19:47:36.698282 1 request.go:655] Throttling request took 1.03357327s, request: GET:https://10.100.0.1:443/apis/dex.coreos.com/v1?timeout=32s {"level":"info","ts":1647978457.6716034,"logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":":8080"} {"level":"info","ts":1647978457.671948,"logger":"entrypoint","msg":"Registering Components."} {"level":"info","ts":1647978457.6743279,"logger":"entrypoint","msg":"Setting up controller."} {"level":"info","ts":1647978457.674349,"logger":"experiment-controller","msg":"Using the default suggestion implementation"} {"level":"info","ts":1647978457.6744256,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6744478,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6744595,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6744664,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.674474,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.674481,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.674485,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6744885,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.674493,"logger":"experiment-controller","msg":"Experiment controller created"} {"level":"info","ts":1647978457.674517,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6745222,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6745288,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6745324,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6745372,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6745434,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6745472,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6745508,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6745555,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6745596,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6745625,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.674566,"logger":"suggestion-controller","msg":"Suggestion controller created"} {"level":"info","ts":1647978457.6746092,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6746144,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.674662,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6746662,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6746714,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6746764,"logger":"trial-controller","msg":"Job watch added successfully","CRD Group":"batch","CRD Version":"v1","CRD Kind":"Job"} {"level":"info","ts":1647978457.6747062,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6747098,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6747131,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6747172,"logger":"trial-controller","msg":"Job watch added successfully","CRD Group":"kubeflow.org","CRD Version":"v1","CRD Kind":"TFJob"} {"level":"info","ts":1647978457.6747453,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6747487,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.674752,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6747553,"logger":"trial-controller","msg":"Job watch added successfully","CRD Group":"kubeflow.org","CRD Version":"v1","CRD Kind":"PyTorchJob"} {"level":"info","ts":1647978457.6747775,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6747808,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.674784,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6747878,"logger":"trial-controller","msg":"Job watch added successfully","CRD Group":"kubeflow.org","CRD Version":"v1","CRD Kind":"MPIJob"} {"level":"info","ts":1647978457.6748126,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6748164,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6748192,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6748228,"logger":"trial-controller","msg":"Job watch added successfully","CRD Group":"kubeflow.org","CRD Version":"v1","CRD Kind":"XGBoostJob"} {"level":"info","ts":1647978457.6748462,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.67485,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6748536,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.674857,"logger":"trial-controller","msg":"Job watch added successfully","CRD Group":"kubeflow.org","CRD Version":"v1","CRD Kind":"MXJob"} {"level":"info","ts":1647978457.6748607,"logger":"trial-controller","msg":"Trial controller created"} {"level":"info","ts":1647978457.674864,"logger":"entrypoint","msg":"Setting up webhooks."} {"level":"info","ts":1647978457.6748724,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.674973,"logger":"controller-runtime.webhook","msg":"registering webhook","path":"/validate-experiment"} {"level":"info","ts":1647978457.6749876,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6749978,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6750028,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6750445,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6750941,"logger":"controller-runtime.webhook","msg":"registering webhook","path":"/mutate-experiment"} {"level":"info","ts":1647978457.675098,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6751046,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.675108,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6751313,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6751745,"logger":"controller-runtime.webhook","msg":"registering webhook","path":"/mutate-pod"} {"level":"info","ts":1647978457.6751783,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6751873,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6751907,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.675213,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1647978457.6752174,"logger":"entrypoint","msg":"Starting the Cmd."} {"level":"info","ts":1647978457.6864378,"logger":"controller-runtime.webhook.webhooks","msg":"starting webhook server"} {"level":"info","ts":1647978457.6866453,"logger":"controller-runtime.certwatcher","msg":"Updated current TLS certificate"} {"level":"info","ts":1647978457.686703,"logger":"controller-runtime.webhook","msg":"serving webhook server","host":"","port":8443} {"level":"info","ts":1647978457.6867871,"logger":"controller-runtime.manager.controller.experiment-controller","msg":"Starting EventSource","source":"kind source: /, Kind="} {"level":"info","ts":1647978457.6869707,"logger":"controller-runtime.manager.controller.suggestion-controller","msg":"Starting EventSource","source":"kind source: /, Kind="} {"level":"info","ts":1647978457.6870694,"logger":"controller-runtime.manager.controller.trial-controller","msg":"Starting EventSource","source":"kind source: /, Kind="} {"level":"info","ts":1647978457.6983204,"logger":"controller-runtime.certwatcher","msg":"Starting certificate watcher"} {"level":"info","ts":1647978457.6991751,"logger":"controller-runtime.manager","msg":"starting metrics server","path":"/metrics"} {"level":"info","ts":1647978457.7876654,"logger":"controller-runtime.manager.controller.trial-controller","msg":"Starting EventSource","source":"kind source: batch/v1, Kind=Job"} {"level":"info","ts":1647978457.78791,"logger":"controller-runtime.manager.controller.experiment-controller","msg":"Starting EventSource","source":"kind source: /, Kind="} {"level":"info","ts":1647978457.7879503,"logger":"controller-runtime.manager.controller.experiment-controller","msg":"Starting EventSource","source":"kind source: /, Kind="} {"level":"info","ts":1647978457.7879674,"logger":"controller-runtime.manager.controller.experiment-controller","msg":"Starting Controller"} {"level":"info","ts":1647978457.7880077,"logger":"controller-runtime.manager.controller.suggestion-controller","msg":"Starting EventSource","source":"kind source: /, Kind="} {"level":"info","ts":1647978457.8883045,"logger":"controller-runtime.manager.controller.suggestion-controller","msg":"Starting EventSource","source":"kind source: /, Kind="} {"level":"info","ts":1647978457.8885317,"logger":"controller-runtime.manager.controller.trial-controller","msg":"Starting EventSource","source":"kind source: kubeflow.org/v1, Kind=TFJob"} {"level":"info","ts":1647978457.9890933,"logger":"controller-runtime.manager.controller.experiment-controller","msg":"Starting workers","worker count":1} {"level":"info","ts":1647978457.9897153,"logger":"controller-runtime.manager.controller.trial-controller","msg":"Starting EventSource","source":"kind source: kubeflow.org/v1, Kind=PyTorchJob"} {"level":"info","ts":1647978457.9901621,"logger":"controller-runtime.manager.controller.suggestion-controller","msg":"Starting EventSource","source":"kind source: /, Kind="} {"level":"info","ts":1647978458.0914237,"logger":"controller-runtime.manager.controller.trial-controller","msg":"Starting EventSource","source":"kind source: kubeflow.org/v1, Kind=MPIJob"} {"level":"info","ts":1647978458.0920405,"logger":"controller-runtime.manager.controller.suggestion-controller","msg":"Starting Controller"} {"level":"info","ts":1647978458.1927216,"logger":"controller-runtime.manager.controller.trial-controller","msg":"Starting EventSource","source":"kind source: kubeflow.org/v1, Kind=XGBoostJob"} {"level":"info","ts":1647978458.293133,"logger":"controller-runtime.manager.controller.trial-controller","msg":"Starting EventSource","source":"kind source: kubeflow.org/v1, Kind=MXJob"} {"level":"info","ts":1647978458.2938151,"logger":"controller-runtime.manager.controller.suggestion-controller","msg":"Starting workers","worker count":1} {"level":"info","ts":1647978458.3940845,"logger":"controller-runtime.manager.controller.trial-controller","msg":"Starting Controller"} {"level":"info","ts":1647978458.5143259,"logger":"controller-runtime.manager.controller.trial-controller","msg":"Starting workers","worker count":1} {"level":"info","ts":1647978458.5252428,"logger":"suggestion-controller","msg":"Sync assignments","Suggestion":"kubeflow-user-example-com/random2","Suggestion Requests":3,"Suggestion Count":3} {"level":"info","ts":1647978458.5519454,"logger":"suggestion-controller","msg":"Sync assignments","Suggestion":"kubeflow-user-example-com/pytorchjob-mnist","Suggestion Requests":3,"Suggestion Count":3} 2022/03/22 19:47:52 http: TLS handshake error from 192.168.101.32:51638: remote error: tls: bad certificate {"level":"info","ts":1647978486.5489805,"logger":"suggestion-controller","msg":"Sync assignments","Suggestion":"kubeflow/random","Suggestion Requests":3,"Suggestion Count":3} {"level":"info","ts":1647978486.5796025,"logger":"suggestion-controller","msg":"Sync assignments","Suggestion":"kubeflow/random","Suggestion Requests":3,"Suggestion Count":3}

charlescurt avatar Mar 22 '22 20:03 charlescurt

This is typically a Katib deployment or Experiment config issue.

Adding steps to debug for future ref

An experiment spawns multiple trials. For each Trial, a worker Job is created from the TrialSpec of the Experiment Yaml. See https://github.com/kubeflow/katib/blob/master/examples/v1beta1/hp-tuning/random.yaml#L49. When worker Job is created, a metrics collector sidecar is created. Once worker Job is succeeded, metrics collector sidecar collects metrics from the main container and submits to the Katib DB Manager. If latter cannot be completed. Trial will not be successful even though underlying worker is successful.

Check Experiment status and Trial status

Does it have MetricUnavailable Trials? If Yes, then check the following

  1. Is Katib DB manager deployment up and Running? If not, underlying database(mysql as default) is not ready
  2. Check logs of metrics collector side car to see if it is able to send metric to Katib DB manager. If not, either metrics cannot be collected(wrong log format) or cannot be sent to DB manager(network settings issue)

/cc @tenzen-y

johnugeorge avatar Aug 17 '22 06:08 johnugeorge

As described by @johnugeorge, katib-controller adds MetricUnavailable status to your Experiment and Trial resources in the following cases:

  1. katib-controller failed to inject metrics-collector as a sidecar container.
  2. metrics-collector can not collect any metrics or failed to report katib-db-manager.

If you faced the first case, please check network policies or istio sidecar settings and so on. If you faced the second case, please check even if your training code output appropriate format logs. If you are having trouble about setting regular expressions, I would suggest json-format-logs.

example for json-format-logs: https://github.com/kubeflow/katib/blob/478e01d612c0d1c05c7fceea29edbea922b30b9e/examples/v1beta1/metrics-collector/file-metrics-collector-with-json-format.yaml

Also, MetricUnavailable status was introduced in katib v0.14.0-rc.0, so if you are using an older version, please update it.

tenzen-y avatar Aug 17 '22 07:08 tenzen-y

@johnugeorge @tenzen-y thanks for the explanation. In my case, adding metricsCollectorSpec: to the grid example solved the issue. I am able to click the Trials/YAML/Details tabs in the Experiment view in the Web UI, and though I am still waiting for the trials to finish, I am positive they will succeed eventually. I have also increased the number of epochs to allow the script to reach an optimal. Is this a valid solution? If so, I can gladly update some of the examples in kubeflow/katib that are missing the metrics spec.

DnPlas avatar Aug 30 '22 03:08 DnPlas

@DnPlas That is bit strange. If metric collector is not specified, it is defaulted to StdOutCollector. https://github.com/kubeflow/katib/blob/3906f8e8e00b5923be05f5b9a07442baa0ada2fa/pkg/apis/controller/experiments/v1beta1/experiment_defaults.go#L131

Are your webhooks working correctly? https://github.com/kubeflow/katib/blob/3906f8e8e00b5923be05f5b9a07442baa0ada2fa/pkg/webhook/v1beta1/webhook.go#L45

johnugeorge avatar Aug 30 '22 09:08 johnugeorge

@johnugeorge My webhook is in the cluster, though I am not sure if there is a misconfiguration.

DnPlas avatar Aug 30 '22 14:08 DnPlas

Once you apply grid example, can you verify if metricsCollectorSpec is populated automatically with `kubectl get experiment ? If not, you need to verify why webhooks are not working in your k8s cluster. Are webhooks working for non-katib components?

johnugeorge avatar Aug 30 '22 17:08 johnugeorge

@johnugeorge my trials' metricsCollector is empty, I will check the webhook and try to see what might gone wrong.

spec:
  metricsCollector: {}

DnPlas avatar Aug 30 '22 19:08 DnPlas

I have the same situation as @dnplas, and the Experiment's metricsCollectorSpec is absent (no key), so I suspect the webhook handling the experiment is misbehaving but I'm not sure how

ca-scribner avatar Sep 02 '22 18:09 ca-scribner

While debugging this for my case, I noticed the katib mutatingwebhooks are set to failurePolicy: Ignore. Setting that to Fail gave me some better hints to go on:

kubectl create -f custom-metrics-collector.yaml                          [4:47:19] 
Error from server (InternalError): error when creating "custom-metrics-collector.yaml": Internal error occurred: failed calling webhook "defaulter.experiment.katib.kubeflow.org": could not get REST client: unable to load root certificates: unable to parse bytes as PEM block

So it appears I have some other configuration problems around my certs to dig into

ca-scribner avatar Sep 02 '22 20:09 ca-scribner

I was able to fix my issue, though I discovered mine was different than @ccastro-pedro, as I was not getting the sidecar metrics container injected in the trials pods due to some missing webhook certs.

In order to further debug (in case anyone runs into similar issues), I suggest to:

  • Check the above comments from @johnugeorge and @tenzen-y
  • Check the validation and webhook certificates
    • Change the failurePolicy to Fail, it helps to catch errors in the early stages and prevents things to fail silently
  • Verify that the metrics container can communicate with the katib-db

@charlescurt your issue looks similar to what I was experiencing. In my case, I had to ensure the caBundle was correctly populated in the validation and mutating webhooks. Check the certificates at /tmp/cert in the webhook server and make sure you can make POST requests to it using the certs in your webhooks.

DnPlas avatar Sep 06 '22 18:09 DnPlas

Dear @DnPlas :

In order to further debug (in case anyone runs into similar issues), I suggest to:

  • Check the above comments from @johnugeorge and @tenzen-y

  • Check the validation and webhook certificates

    • Change the failurePolicy to Fail, it helps to catch errors in the early stages and prevents things to fail silently

Could you please indicate what do you refer exactly with that? In which pod should I change the failurePolicy to fail? I have not found the pod in which I can do it.

  • Verify that the metrics container can communicate with the katib-db

Could you specify a bit more how I can verify that? Thank you in advance.

@charlescurt your issue looks similar to what I was experiencing. In my case, I had to ensure the caBundle was correctly populated in the validation and mutating webhooks. Check the certificates at /tmp/cert in the webhook server and make sure you can make POST requests to it using the certs in your webhooks.

joaquingarciaatos avatar Sep 12 '22 14:09 joaquingarciaatos

@joaquingarciaatos After facing this same issue, I eventually realized the problem was that while the webhook service listens on port 443, the webhook pod listens on port 8443 and network traffic to 8443 was blocked in our infrastructure setup. Opening port 8443 on our system nodes solved the problem.

sjmccorm1993 avatar Sep 13 '22 14:09 sjmccorm1993

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Sep 12 '23 00:09 github-actions[bot]

In the latest Katib version, failurePolicy: Fail by default: https://github.com/kubeflow/katib/blob/master/CHANGELOG.md#v0160-rc0-2023-08-05 So the Webhook errors can be caught easily. Feel free to open a new issue, if you see any other problems.

andreyvelich avatar Sep 12 '23 12:09 andreyvelich