katib-db-manager: hook failed: "update-status"
Bug Description
As shown in below juju status, katib-db-manager unit is stuck with " hook failed: "update-status""
mm323:~$ juju status
Model Controller Cloud/Region Version SLA Timestamp
kubeflow uk8sx my-k8s/localhost 2.9.49 unsupported 16:36:42-07:00
App Version Status Scale Charm Channel Rev Address Exposed Message
admission-webhook res:oci-image@2d74d1b active 1 admission-webhook 1.7/stable 224 10.152.183.247 no
argo-controller res:oci-image@3902c16 active 1 argo-controller 3.3/stable 376 no
argo-server res:oci-image@e2292c9 active 1 argo-server 3.3/stable 309 no
dex-auth active 1 dex-auth 2.31/stable 389 10.152.183.43 no
istio-ingressgateway active 1 istio-gateway 1.16/stable 1005 10.152.183.29 no
istio-pilot active 1 istio-pilot 1.16/stable 662 10.152.183.128 no
jupyter-controller res:oci-image@1167186 active 1 jupyter-controller 1.7/stable 805 no
jupyter-ui active 1 jupyter-ui 1.7/stable 781 10.152.183.198 no
katib-controller res:oci-image@111495a active 1 katib-controller 0.15/stable 282 10.152.183.65 no
katib-db 8.0.36-0ubuntu0.22.04.1 active 1 mysql-k8s 8.0/stable 153 10.152.183.12 no
katib-db-manager waiting 1 katib-db-manager 0.15/stable 253 10.152.183.151 no installing agent
katib-ui active 1 katib-ui 0.15/stable 267 10.152.183.13 no
kfp-api active 1 kfp-api 2.0-alpha.7/stable 935 10.152.183.50 no
kfp-db 8.0.36-0ubuntu0.22.04.1 active 1 mysql-k8s 8.0/stable 153 10.152.183.84 no
kfp-persistence res:oci-image@ebed770 active 1 kfp-persistence 2.0-alpha.7/stable 939 no
kfp-profile-controller res:oci-image@aa75b0c active 1 kfp-profile-controller 2.0-alpha.7/stable 899 10.152.183.56 no
kfp-schedwf res:oci-image@2cb9087 active 1 kfp-schedwf 2.0-alpha.7/stable 952 no
kfp-ui res:oci-image@ae72602 active 1 kfp-ui 2.0-alpha.7/stable 934 10.152.183.217 no
kfp-viewer res:oci-image@899e25f active 1 kfp-viewer 2.0-alpha.7/stable 964 no
kfp-viz res:oci-image@ffaf37e active 1 kfp-viz 2.0-alpha.7/stable 889 10.152.183.70 no
knative-eventing active 1 knative-eventing 1.8/stable 345 10.152.183.174 no
knative-operator active 1 knative-operator 1.8/stable 320 10.152.183.208 no
knative-serving active 1 knative-serving 1.8/stable 346 10.152.183.73 no
kserve-controller active 1 kserve-controller 0.10/stable 458 10.152.183.177 no
kubeflow-dashboard active 1 kubeflow-dashboard 1.7/stable 439 10.152.183.206 no
kubeflow-profiles active 1 kubeflow-profiles 1.7/stable 336 10.152.183.112 no
kubeflow-roles active 1 kubeflow-roles 1.7/stable 148 10.152.183.3 no
kubeflow-volumes res:oci-image@d261609 active 1 kubeflow-volumes 1.7/stable 204 10.152.183.41 no
metacontroller-operator active 1 metacontroller-operator 2.0/stable 204 10.152.183.120 no
minio res:oci-image@1755999 active 1 minio ckf-1.7/stable 214 10.152.183.121 no
oidc-gatekeeper res:oci-image@7aae6d7 active 1 oidc-gatekeeper ckf-1.7/stable 320 10.152.183.75 no
seldon-controller-manager active 1 seldon-core 1.15/stable 548 10.152.183.59 no
tensorboard-controller res:oci-image@c52f7c2 active 1 tensorboard-controller 1.7/stable 156 10.152.183.71 no
tensorboards-web-app res:oci-image@929f55b active 1 tensorboards-web-app 1.7/stable 158 10.152.183.115 no
training-operator active 1 training-operator 1.6/stable 305 10.152.183.76 no
Unit Workload Agent Address Ports Message
admission-webhook/0* active idle 10.1.121.226 4443/TCP
argo-controller/0* active idle 10.1.69.129
argo-server/0* active idle 10.1.121.229 2746/TCP
dex-auth/0* active idle 10.1.121.204
istio-ingressgateway/0* active idle 10.1.121.205
istio-pilot/0* active idle 10.1.69.161
jupyter-controller/0* active idle 10.1.121.231
jupyter-ui/0* active idle 10.1.69.164
katib-controller/0* active idle 10.1.121.230 443/TCP,8080/TCP
katib-db-manager/0* error idle 10.1.121.206 hook failed: "update-status"
katib-db/0* active idle 10.1.69.167 Primary
katib-ui/0* active idle 10.1.121.208
kfp-api/0* active idle 10.1.121.209
kfp-db/0* active idle 10.1.121.211 Primary
kfp-persistence/0* active idle 10.1.69.133
kfp-profile-controller/0* active idle 10.1.69.130 80/TCP
kfp-schedwf/0* active idle 10.1.121.232
kfp-ui/0* active idle 10.1.69.136 3000/TCP
kfp-viewer/0* active idle 10.1.69.179
kfp-viz/0* active idle 10.1.69.131 8888/TCP
knative-eventing/0* active idle 10.1.69.168
knative-operator/0* active idle 10.1.69.171
knative-serving/0* active idle 10.1.69.170
kserve-controller/0* active idle 10.1.69.173
kubeflow-dashboard/0* active idle 10.1.69.172
kubeflow-profiles/0* active idle 10.1.69.175
kubeflow-roles/0* active idle 10.1.69.174
kubeflow-volumes/0* active idle 10.1.121.217 5000/TCP
metacontroller-operator/0* active idle 10.1.121.212
minio/0* active idle 10.1.121.221 9000/TCP,9001/TCP
oidc-gatekeeper/0* active idle 10.1.69.141 8080/TCP
seldon-controller-manager/0* active idle 10.1.69.177
tensorboard-controller/0* active idle 10.1.69.135 9443/TCP
tensorboards-web-app/0* active idle 10.1.69.182 5000/TCP
training-operator/0* active idle 10.1.121.215
To Reproduce
sudo snap install microk8s --channel=1.24/stable --classic
sudo snap install juju --classic --channel=2.9/stable
microk8s config | juju add-k8s my-k8s --client
juju bootstrap my-k8s uk8sx
juju add-model kubeflow
juju deploy kubeflow --trust --channel=1.7/stable
Environment
Ubuntu:22.04
microk8s:1.24
juju:2.9
kubeflow:1.7
Relevant Log Output
Attaching below logs
$ microk8s.kubectl logs -n kubeflow katib-db-manager-0 > katib-db-manager-0
Defaulted container "charm" out of: charm, katib-db-manager, charm-init (init)
$ microk8s.kubectl logs -n kubeflow katib-db-0 > katib-db-0
Defaulted container "charm" out of: charm, mysql, charm-init (init)
Additional Context
No response
Thank you for reporting us your feedback!
The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5953.
This message was autogenerated
When I tried deleting respective katib-manager pod, microk8s automatically started the respective pod and then it came up properly
kubectl delete pod katib-db-manager-0 -n kubeflow
But now I am having issue with the kubeflow UI
When I login with the weblink http://10.10.26.236:31456/ I land up in below page
Clicked "start setup"
Now once I click finish button nothing happens. I am not getting redirected to next page. But kubectl says profile is created.
$ kubectl get profiles
NAME AGE
admin 20m
In between katib-db went down
unit-katib-ui-0: 21:38:51 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-katib-db-0: 21:39:08 ERROR unit.katib-db/0.juju-log Failed to flush [<MySQLTextLogs.ERROR: 'ERROR LOGS'>, <MySQLTextLogs.GENERAL: 'GENERAL LOGS'>, <MySQLTextLogs.SLOW: 'SLOW LOGS'>] logs.
Traceback (most recent call last):
File "/var/lib/juju/agents/unit-katib-db-0/charm/src/mysql_k8s_helpers.py", line 602, in _run_mysqlsh_script
stdout, _ = process.wait_output()
File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/ops/pebble.py", line 1635, in wait_output
raise ExecError[AnyStr](self._command, exit_code, out_value, err_value)
ops.pebble.ExecError: non-zero exit code 1 executing ['/usr/bin/mysqlsh', '--no-wizard', '--python', '--verbose=1', '-f', '/tmp/script.py', ';', 'rm', '/tmp/script.py'], stdout='', stderr='Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory\nverbose: 2024-07-02T04:39:06Z: Loading startup files...\nverbose: 2024-07-02T04:39:06Z: Loading plugins...\nverbose: 2024-07-02T04:39:06Z: Connecting to MySQL at: serverconfig@katib-db-0.katib-db-endpoints.kubeflow.svc.cluster.local\nTraceback (most recent call last):\n File "<string>", line 1, in <module>\nmysqlsh.DBError: MySQL Error (2013): Shell.connect: Lost connection to MySQL server at \'reading initial communication packet\', system error: 104\n'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/var/lib/juju/agents/unit-katib-db-0/charm/lib/charms/mysql/v0/mysql.py", line 3139, in flush_mysql_logs
self._run_mysqlsh_script("\n".join(flush_logs_commands), timeout=50)
File "/var/lib/juju/agents/unit-katib-db-0/charm/lib/charms/tempo_k8s/v1/charm_tracing.py", line 544, in wrapped_function
return callable(*args, **kwargs) # type: ignore
File "/var/lib/juju/agents/unit-katib-db-0/charm/src/mysql_k8s_helpers.py", line 605, in _run_mysqlsh_script
raise MySQLClientError(e.stderr)
charms.mysql.v0.mysql.MySQLClientError: Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory
verbose: 2024-07-02T04:39:06Z: Loading startup files...
verbose: 2024-07-02T04:39:06Z: Loading plugins...
verbose: 2024-07-02T04:39:06Z: Connecting to MySQL at: serverconfig@katib-db-0.katib-db-endpoints.kubeflow.svc.cluster.local
Traceback (most recent call last):
File "<string>", line 1, in <module>
mysqlsh.DBError: MySQL Error (2013): Shell.connect: Lost connection to MySQL server at 'reading initial communication packet', system error: 104
unit-katib-db-0: 21:39:14 INFO unit.katib-db/0.juju-log Setting up the logrotate configurations
unit-katib-db-0: 21:39:14 INFO unit.katib-db/0.juju-log Adding pebble layer
unit-katib-db-0: 21:39:22 INFO unit.katib-db/0.juju-log Unit workload member-state is offline with member-role unknown
unit-katib-db-0: 21:39:22 INFO unit.katib-db/0.juju-log Attempting reboot from complete outage.
$ juju status
Model Controller Cloud/Region Version SLA Timestamp
kubeflow uk8sx my-k8s/localhost 2.9.49 unsupported 21:42:42-07:00
App Version Status Scale Charm Channel Rev Address Exposed Message
admission-webhook res:oci-image@2d74d1b active 1 admission-webhook 1.7/stable 224 10.152.183.247 no
argo-controller res:oci-image@3902c16 active 1 argo-controller 3.3/stable 376 no
argo-server res:oci-image@e2292c9 active 1 argo-server 3.3/stable 309 no
dex-auth active 1 dex-auth 2.31/stable 389 10.152.183.43 no
istio-ingressgateway active 1 istio-gateway 1.16/stable 1005 10.152.183.29 no
istio-pilot active 1 istio-pilot 1.16/stable 662 10.152.183.128 no
jupyter-controller res:oci-image@1167186 active 1 jupyter-controller 1.7/stable 805 no
jupyter-ui active 1 jupyter-ui 1.7/stable 781 10.152.183.198 no
katib-controller res:oci-image@111495a active 1 katib-controller 0.15/stable 282 10.152.183.65 no
katib-db 8.0.36-0ubuntu0.22.04.1 waiting 1 mysql-k8s 8.0/stable 153 10.152.183.12 no installing agent
katib-db-manager active 1 katib-db-manager 0.15/stable 253 10.152.183.151 no
katib-ui active 1 katib-ui 0.15/stable 267 10.152.183.13 no
kfp-api active 1 kfp-api 2.0-alpha.7/stable 935 10.152.183.50 no
kfp-db 8.0.36-0ubuntu0.22.04.1 active 1 mysql-k8s 8.0/stable 153 10.152.183.84 no
kfp-persistence res:oci-image@ebed770 active 1 kfp-persistence 2.0-alpha.7/stable 939 no
kfp-profile-controller res:oci-image@aa75b0c active 1 kfp-profile-controller 2.0-alpha.7/stable 899 10.152.183.56 no
kfp-schedwf res:oci-image@2cb9087 active 1 kfp-schedwf 2.0-alpha.7/stable 952 no
kfp-ui res:oci-image@ae72602 active 1 kfp-ui 2.0-alpha.7/stable 934 10.152.183.217 no
kfp-viewer res:oci-image@899e25f active 1 kfp-viewer 2.0-alpha.7/stable 964 no
kfp-viz res:oci-image@ffaf37e active 1 kfp-viz 2.0-alpha.7/stable 889 10.152.183.70 no
knative-eventing active 1 knative-eventing 1.8/stable 345 10.152.183.174 no
knative-operator active 1 knative-operator 1.8/stable 320 10.152.183.208 no
knative-serving active 1 knative-serving 1.8/stable 346 10.152.183.73 no
kserve-controller active 1 kserve-controller 0.10/stable 458 10.152.183.177 no
kubeflow-dashboard active 1 kubeflow-dashboard 1.7/stable 439 10.152.183.206 no
kubeflow-profiles active 1 kubeflow-profiles 1.7/stable 336 10.152.183.112 no
kubeflow-roles active 1 kubeflow-roles 1.7/stable 148 10.152.183.3 no
kubeflow-volumes res:oci-image@d261609 active 1 kubeflow-volumes 1.7/stable 204 10.152.183.41 no
metacontroller-operator active 1 metacontroller-operator 2.0/stable 204 10.152.183.120 no
minio res:oci-image@1755999 active 1 minio ckf-1.7/stable 214 10.152.183.121 no
oidc-gatekeeper res:oci-image@7aae6d7 active 1 oidc-gatekeeper ckf-1.7/stable 320 10.152.183.75 no
seldon-controller-manager active 1 seldon-core 1.15/stable 548 10.152.183.59 no
tensorboard-controller res:oci-image@c52f7c2 active 1 tensorboard-controller 1.7/stable 156 10.152.183.71 no
tensorboards-web-app res:oci-image@929f55b active 1 tensorboards-web-app 1.7/stable 158 10.152.183.115 no
training-operator active 1 training-operator 1.6/stable 305 10.152.183.76 no
Unit Workload Agent Address Ports Message
admission-webhook/0* active idle 10.1.121.226 4443/TCP
argo-controller/0* active idle 10.1.69.129
argo-server/0* active idle 10.1.121.229 2746/TCP
dex-auth/0* active idle 10.1.121.204
istio-ingressgateway/0* active idle 10.1.121.205
istio-pilot/0* active idle 10.1.69.161
jupyter-controller/0* active idle 10.1.121.231
jupyter-ui/0* active idle 10.1.69.164
katib-controller/0* active idle 10.1.121.230 443/TCP,8080/TCP
katib-db-manager/0* active idle 10.1.69.142
katib-db/0* maintenance idle 10.1.69.167 offline
katib-ui/0* active idle 10.1.121.208
kfp-api/0* active idle 10.1.121.209
kfp-db/0* active idle 10.1.121.211 Primary
It got restarted automatically and came up properly
@shayancanonical maybe you should also take a look at this one
But now I am having issue with the kubeflow UI
When I login with the weblink http://10.10.26.236:31456/ I land up in below page
Clicked "start setup"
Now once I click finish button nothing happens. I am not getting redirected to next page. But kubectl says profile is created.
$ kubectl get profiles NAME AGE admin 20m
On server reboot, juju status was showing multiple units in "agent lost, see 'juju show-status-log" state. Restarted respective pods which are in agent lost state that includes dex. I again tried logging into kubeflow ui and this time I have passed beyond initial startup windows.
@ACodingfreak Thank you for reporting this. 1.7 is going out of support in about a week and since we have not seen this issue (referring to katib-db-manager behaviour) in later versions, it won't be prioritized by the team as we are focusing on newer releases. What we could suggest is to upgrade your Kubeflow deployment to 1.8 following these instructions.
@orfeas-k - Thanks for the update regarding 1.7 release
Before I ended up with 1.7 release I was trying with below versions and landed up in another katib-db-manager issue https://github.com/canonical/bundle-kubeflow/issues/961
Microk8s: 1.29/stable Juju: 3.4/stable Kubeflow:1.8/stable
Ok @ACodingfreak, this makes sense. I see that the issue has been resolved though, which means that you should not have an issue with newer versions. If that's not the case, feel free to open another issue.
Well it works properly now with below versions. Since 1.7 is EOL I can move into 1.8
Microk8s: 1.29/stable Juju: 3.4/stable Kubeflow:1.8/stable

