bundle-kubeflow icon indicating copy to clipboard operation
bundle-kubeflow copied to clipboard

katib-db-manager: hook failed: "update-status"

Open ACodingfreak opened this issue 1 year ago • 6 comments

Bug Description

As shown in below juju status, katib-db-manager unit is stuck with " hook failed: "update-status""

mm323:~$ juju status
Model     Controller  Cloud/Region      Version  SLA          Timestamp
kubeflow  uk8sx       my-k8s/localhost  2.9.49   unsupported  16:36:42-07:00

App                        Version                  Status   Scale  Charm                    Channel              Rev  Address         Exposed  Message
admission-webhook          res:oci-image@2d74d1b    active       1  admission-webhook        1.7/stable           224  10.152.183.247  no
argo-controller            res:oci-image@3902c16    active       1  argo-controller          3.3/stable           376                  no
argo-server                res:oci-image@e2292c9    active       1  argo-server              3.3/stable           309                  no
dex-auth                                            active       1  dex-auth                 2.31/stable          389  10.152.183.43   no
istio-ingressgateway                                active       1  istio-gateway            1.16/stable         1005  10.152.183.29   no
istio-pilot                                         active       1  istio-pilot              1.16/stable          662  10.152.183.128  no
jupyter-controller         res:oci-image@1167186    active       1  jupyter-controller       1.7/stable           805                  no
jupyter-ui                                          active       1  jupyter-ui               1.7/stable           781  10.152.183.198  no
katib-controller           res:oci-image@111495a    active       1  katib-controller         0.15/stable          282  10.152.183.65   no
katib-db                   8.0.36-0ubuntu0.22.04.1  active       1  mysql-k8s                8.0/stable           153  10.152.183.12   no
katib-db-manager                                    waiting      1  katib-db-manager         0.15/stable          253  10.152.183.151  no       installing agent
katib-ui                                            active       1  katib-ui                 0.15/stable          267  10.152.183.13   no
kfp-api                                             active       1  kfp-api                  2.0-alpha.7/stable   935  10.152.183.50   no
kfp-db                     8.0.36-0ubuntu0.22.04.1  active       1  mysql-k8s                8.0/stable           153  10.152.183.84   no
kfp-persistence            res:oci-image@ebed770    active       1  kfp-persistence          2.0-alpha.7/stable   939                  no
kfp-profile-controller     res:oci-image@aa75b0c    active       1  kfp-profile-controller   2.0-alpha.7/stable   899  10.152.183.56   no
kfp-schedwf                res:oci-image@2cb9087    active       1  kfp-schedwf              2.0-alpha.7/stable   952                  no
kfp-ui                     res:oci-image@ae72602    active       1  kfp-ui                   2.0-alpha.7/stable   934  10.152.183.217  no
kfp-viewer                 res:oci-image@899e25f    active       1  kfp-viewer               2.0-alpha.7/stable   964                  no
kfp-viz                    res:oci-image@ffaf37e    active       1  kfp-viz                  2.0-alpha.7/stable   889  10.152.183.70   no
knative-eventing                                    active       1  knative-eventing         1.8/stable           345  10.152.183.174  no
knative-operator                                    active       1  knative-operator         1.8/stable           320  10.152.183.208  no
knative-serving                                     active       1  knative-serving          1.8/stable           346  10.152.183.73   no
kserve-controller                                   active       1  kserve-controller        0.10/stable          458  10.152.183.177  no
kubeflow-dashboard                                  active       1  kubeflow-dashboard       1.7/stable           439  10.152.183.206  no
kubeflow-profiles                                   active       1  kubeflow-profiles        1.7/stable           336  10.152.183.112  no
kubeflow-roles                                      active       1  kubeflow-roles           1.7/stable           148  10.152.183.3    no
kubeflow-volumes           res:oci-image@d261609    active       1  kubeflow-volumes         1.7/stable           204  10.152.183.41   no
metacontroller-operator                             active       1  metacontroller-operator  2.0/stable           204  10.152.183.120  no
minio                      res:oci-image@1755999    active       1  minio                    ckf-1.7/stable       214  10.152.183.121  no
oidc-gatekeeper            res:oci-image@7aae6d7    active       1  oidc-gatekeeper          ckf-1.7/stable       320  10.152.183.75   no
seldon-controller-manager                           active       1  seldon-core              1.15/stable          548  10.152.183.59   no
tensorboard-controller     res:oci-image@c52f7c2    active       1  tensorboard-controller   1.7/stable           156  10.152.183.71   no
tensorboards-web-app       res:oci-image@929f55b    active       1  tensorboards-web-app     1.7/stable           158  10.152.183.115  no
training-operator                                   active       1  training-operator        1.6/stable           305  10.152.183.76   no

Unit                          Workload  Agent  Address       Ports              Message
admission-webhook/0*          active    idle   10.1.121.226  4443/TCP
argo-controller/0*            active    idle   10.1.69.129
argo-server/0*                active    idle   10.1.121.229  2746/TCP
dex-auth/0*                   active    idle   10.1.121.204
istio-ingressgateway/0*       active    idle   10.1.121.205
istio-pilot/0*                active    idle   10.1.69.161
jupyter-controller/0*         active    idle   10.1.121.231
jupyter-ui/0*                 active    idle   10.1.69.164
katib-controller/0*           active    idle   10.1.121.230  443/TCP,8080/TCP
katib-db-manager/0*           error     idle   10.1.121.206                     hook failed: "update-status"
katib-db/0*                   active    idle   10.1.69.167                      Primary
katib-ui/0*                   active    idle   10.1.121.208
kfp-api/0*                    active    idle   10.1.121.209
kfp-db/0*                     active    idle   10.1.121.211                     Primary
kfp-persistence/0*            active    idle   10.1.69.133
kfp-profile-controller/0*     active    idle   10.1.69.130   80/TCP
kfp-schedwf/0*                active    idle   10.1.121.232
kfp-ui/0*                     active    idle   10.1.69.136   3000/TCP
kfp-viewer/0*                 active    idle   10.1.69.179
kfp-viz/0*                    active    idle   10.1.69.131   8888/TCP
knative-eventing/0*           active    idle   10.1.69.168
knative-operator/0*           active    idle   10.1.69.171
knative-serving/0*            active    idle   10.1.69.170
kserve-controller/0*          active    idle   10.1.69.173
kubeflow-dashboard/0*         active    idle   10.1.69.172
kubeflow-profiles/0*          active    idle   10.1.69.175
kubeflow-roles/0*             active    idle   10.1.69.174
kubeflow-volumes/0*           active    idle   10.1.121.217  5000/TCP
metacontroller-operator/0*    active    idle   10.1.121.212
minio/0*                      active    idle   10.1.121.221  9000/TCP,9001/TCP
oidc-gatekeeper/0*            active    idle   10.1.69.141   8080/TCP
seldon-controller-manager/0*  active    idle   10.1.69.177
tensorboard-controller/0*     active    idle   10.1.69.135   9443/TCP
tensorboards-web-app/0*       active    idle   10.1.69.182   5000/TCP
training-operator/0*          active    idle   10.1.121.215


To Reproduce

sudo snap install microk8s --channel=1.24/stable --classic
sudo snap install juju --classic --channel=2.9/stable
microk8s config | juju add-k8s my-k8s --client
juju bootstrap my-k8s uk8sx
juju add-model kubeflow
juju deploy kubeflow --trust  --channel=1.7/stable

Environment

Ubuntu:22.04
microk8s:1.24
juju:2.9
kubeflow:1.7

Relevant Log Output

Attaching below logs 

$ microk8s.kubectl logs -n kubeflow katib-db-manager-0 > katib-db-manager-0
Defaulted container "charm" out of: charm, katib-db-manager, charm-init (init)

$ microk8s.kubectl logs -n kubeflow katib-db-0 > katib-db-0
Defaulted container "charm" out of: charm, mysql, charm-init (init)

logs_2.zip

Additional Context

No response

ACodingfreak avatar Jul 02 '24 00:07 ACodingfreak

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5953.

This message was autogenerated

When I tried deleting respective katib-manager pod, microk8s automatically started the respective pod and then it came up properly

kubectl delete pod katib-db-manager-0 -n kubeflow

ACodingfreak avatar Jul 02 '24 03:07 ACodingfreak

But now I am having issue with the kubeflow UI

When I login with the weblink http://10.10.26.236:31456/ I land up in below page

image

Clicked "start setup"

image

Now once I click finish button nothing happens. I am not getting redirected to next page. But kubectl says profile is created.

$ kubectl get profiles
NAME    AGE
admin   20m

ACodingfreak avatar Jul 02 '24 04:07 ACodingfreak

In between katib-db went down

unit-katib-ui-0: 21:38:51 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-katib-db-0: 21:39:08 ERROR unit.katib-db/0.juju-log Failed to flush [<MySQLTextLogs.ERROR: 'ERROR LOGS'>, <MySQLTextLogs.GENERAL: 'GENERAL LOGS'>, <MySQLTextLogs.SLOW: 'SLOW LOGS'>] logs.
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-katib-db-0/charm/src/mysql_k8s_helpers.py", line 602, in _run_mysqlsh_script
    stdout, _ = process.wait_output()
  File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/ops/pebble.py", line 1635, in wait_output
    raise ExecError[AnyStr](self._command, exit_code, out_value, err_value)
ops.pebble.ExecError: non-zero exit code 1 executing ['/usr/bin/mysqlsh', '--no-wizard', '--python', '--verbose=1', '-f', '/tmp/script.py', ';', 'rm', '/tmp/script.py'], stdout='', stderr='Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory\nverbose: 2024-07-02T04:39:06Z: Loading startup files...\nverbose: 2024-07-02T04:39:06Z: Loading plugins...\nverbose: 2024-07-02T04:39:06Z: Connecting to MySQL at: serverconfig@katib-db-0.katib-db-endpoints.kubeflow.svc.cluster.local\nTraceback (most recent call last):\n  File "<string>", line 1, in <module>\nmysqlsh.DBError: MySQL Error (2013): Shell.connect: Lost connection to MySQL server at \'reading initial communication packet\', system error: 104\n'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-katib-db-0/charm/lib/charms/mysql/v0/mysql.py", line 3139, in flush_mysql_logs
    self._run_mysqlsh_script("\n".join(flush_logs_commands), timeout=50)
  File "/var/lib/juju/agents/unit-katib-db-0/charm/lib/charms/tempo_k8s/v1/charm_tracing.py", line 544, in wrapped_function
    return callable(*args, **kwargs)  # type: ignore
  File "/var/lib/juju/agents/unit-katib-db-0/charm/src/mysql_k8s_helpers.py", line 605, in _run_mysqlsh_script
    raise MySQLClientError(e.stderr)
charms.mysql.v0.mysql.MySQLClientError: Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory
verbose: 2024-07-02T04:39:06Z: Loading startup files...
verbose: 2024-07-02T04:39:06Z: Loading plugins...
verbose: 2024-07-02T04:39:06Z: Connecting to MySQL at: serverconfig@katib-db-0.katib-db-endpoints.kubeflow.svc.cluster.local
Traceback (most recent call last):
  File "<string>", line 1, in <module>
mysqlsh.DBError: MySQL Error (2013): Shell.connect: Lost connection to MySQL server at 'reading initial communication packet', system error: 104

unit-katib-db-0: 21:39:14 INFO unit.katib-db/0.juju-log Setting up the logrotate configurations
unit-katib-db-0: 21:39:14 INFO unit.katib-db/0.juju-log Adding pebble layer
unit-katib-db-0: 21:39:22 INFO unit.katib-db/0.juju-log Unit workload member-state is offline with member-role unknown
unit-katib-db-0: 21:39:22 INFO unit.katib-db/0.juju-log Attempting reboot from complete outage.

$ juju status
Model     Controller  Cloud/Region      Version  SLA          Timestamp
kubeflow  uk8sx       my-k8s/localhost  2.9.49   unsupported  21:42:42-07:00

App                        Version                  Status   Scale  Charm                    Channel              Rev  Address         Exposed  Message
admission-webhook          res:oci-image@2d74d1b    active       1  admission-webhook        1.7/stable           224  10.152.183.247  no
argo-controller            res:oci-image@3902c16    active       1  argo-controller          3.3/stable           376                  no
argo-server                res:oci-image@e2292c9    active       1  argo-server              3.3/stable           309                  no
dex-auth                                            active       1  dex-auth                 2.31/stable          389  10.152.183.43   no
istio-ingressgateway                                active       1  istio-gateway            1.16/stable         1005  10.152.183.29   no
istio-pilot                                         active       1  istio-pilot              1.16/stable          662  10.152.183.128  no
jupyter-controller         res:oci-image@1167186    active       1  jupyter-controller       1.7/stable           805                  no
jupyter-ui                                          active       1  jupyter-ui               1.7/stable           781  10.152.183.198  no
katib-controller           res:oci-image@111495a    active       1  katib-controller         0.15/stable          282  10.152.183.65   no
katib-db                   8.0.36-0ubuntu0.22.04.1  waiting      1  mysql-k8s                8.0/stable           153  10.152.183.12   no       installing agent
katib-db-manager                                    active       1  katib-db-manager         0.15/stable          253  10.152.183.151  no
katib-ui                                            active       1  katib-ui                 0.15/stable          267  10.152.183.13   no
kfp-api                                             active       1  kfp-api                  2.0-alpha.7/stable   935  10.152.183.50   no
kfp-db                     8.0.36-0ubuntu0.22.04.1  active       1  mysql-k8s                8.0/stable           153  10.152.183.84   no
kfp-persistence            res:oci-image@ebed770    active       1  kfp-persistence          2.0-alpha.7/stable   939                  no
kfp-profile-controller     res:oci-image@aa75b0c    active       1  kfp-profile-controller   2.0-alpha.7/stable   899  10.152.183.56   no
kfp-schedwf                res:oci-image@2cb9087    active       1  kfp-schedwf              2.0-alpha.7/stable   952                  no
kfp-ui                     res:oci-image@ae72602    active       1  kfp-ui                   2.0-alpha.7/stable   934  10.152.183.217  no
kfp-viewer                 res:oci-image@899e25f    active       1  kfp-viewer               2.0-alpha.7/stable   964                  no
kfp-viz                    res:oci-image@ffaf37e    active       1  kfp-viz                  2.0-alpha.7/stable   889  10.152.183.70   no
knative-eventing                                    active       1  knative-eventing         1.8/stable           345  10.152.183.174  no
knative-operator                                    active       1  knative-operator         1.8/stable           320  10.152.183.208  no
knative-serving                                     active       1  knative-serving          1.8/stable           346  10.152.183.73   no
kserve-controller                                   active       1  kserve-controller        0.10/stable          458  10.152.183.177  no
kubeflow-dashboard                                  active       1  kubeflow-dashboard       1.7/stable           439  10.152.183.206  no
kubeflow-profiles                                   active       1  kubeflow-profiles        1.7/stable           336  10.152.183.112  no
kubeflow-roles                                      active       1  kubeflow-roles           1.7/stable           148  10.152.183.3    no
kubeflow-volumes           res:oci-image@d261609    active       1  kubeflow-volumes         1.7/stable           204  10.152.183.41   no
metacontroller-operator                             active       1  metacontroller-operator  2.0/stable           204  10.152.183.120  no
minio                      res:oci-image@1755999    active       1  minio                    ckf-1.7/stable       214  10.152.183.121  no
oidc-gatekeeper            res:oci-image@7aae6d7    active       1  oidc-gatekeeper          ckf-1.7/stable       320  10.152.183.75   no
seldon-controller-manager                           active       1  seldon-core              1.15/stable          548  10.152.183.59   no
tensorboard-controller     res:oci-image@c52f7c2    active       1  tensorboard-controller   1.7/stable           156  10.152.183.71   no
tensorboards-web-app       res:oci-image@929f55b    active       1  tensorboards-web-app     1.7/stable           158  10.152.183.115  no
training-operator                                   active       1  training-operator        1.6/stable           305  10.152.183.76   no

Unit                          Workload     Agent  Address       Ports              Message
admission-webhook/0*          active       idle   10.1.121.226  4443/TCP
argo-controller/0*            active       idle   10.1.69.129
argo-server/0*                active       idle   10.1.121.229  2746/TCP
dex-auth/0*                   active       idle   10.1.121.204
istio-ingressgateway/0*       active       idle   10.1.121.205
istio-pilot/0*                active       idle   10.1.69.161
jupyter-controller/0*         active       idle   10.1.121.231
jupyter-ui/0*                 active       idle   10.1.69.164
katib-controller/0*           active       idle   10.1.121.230  443/TCP,8080/TCP
katib-db-manager/0*           active       idle   10.1.69.142
katib-db/0*                   maintenance  idle   10.1.69.167                      offline
katib-ui/0*                   active       idle   10.1.121.208
kfp-api/0*                    active       idle   10.1.121.209
kfp-db/0*                     active       idle   10.1.121.211                     Primary

It got restarted automatically and came up properly

ACodingfreak avatar Jul 02 '24 04:07 ACodingfreak

@shayancanonical maybe you should also take a look at this one

DnPlas avatar Jul 02 '24 12:07 DnPlas

But now I am having issue with the kubeflow UI

When I login with the weblink http://10.10.26.236:31456/ I land up in below page

image

Clicked "start setup"

image

Now once I click finish button nothing happens. I am not getting redirected to next page. But kubectl says profile is created.

$ kubectl get profiles
NAME    AGE
admin   20m

On server reboot, juju status was showing multiple units in "agent lost, see 'juju show-status-log" state. Restarted respective pods which are in agent lost state that includes dex. I again tried logging into kubeflow ui and this time I have passed beyond initial startup windows.

ACodingfreak avatar Jul 02 '24 14:07 ACodingfreak

@ACodingfreak Thank you for reporting this. 1.7 is going out of support in about a week and since we have not seen this issue (referring to katib-db-manager behaviour) in later versions, it won't be prioritized by the team as we are focusing on newer releases. What we could suggest is to upgrade your Kubeflow deployment to 1.8 following these instructions.

orfeas-k avatar Jul 17 '24 13:07 orfeas-k

@orfeas-k - Thanks for the update regarding 1.7 release

Before I ended up with 1.7 release I was trying with below versions and landed up in another katib-db-manager issue https://github.com/canonical/bundle-kubeflow/issues/961

Microk8s: 1.29/stable Juju: 3.4/stable Kubeflow:1.8/stable

ACodingfreak avatar Jul 18 '24 17:07 ACodingfreak

Ok @ACodingfreak, this makes sense. I see that the issue has been resolved though, which means that you should not have an issue with newer versions. If that's not the case, feel free to open another issue.

orfeas-k avatar Jul 22 '24 07:07 orfeas-k

Well it works properly now with below versions. Since 1.7 is EOL I can move into 1.8

Microk8s: 1.29/stable Juju: 3.4/stable Kubeflow:1.8/stable

ACodingfreak avatar Jul 22 '24 16:07 ACodingfreak