kubeflow
kubeflow copied to clipboard
KF 1.5 metadata-grpc-deployment "Failed to connect to the database: mysql_real_connect failed"
/kind bug
What steps did you take and what happened:
I can't see the Artifacts from KF 1.5 Webapp from my KF 1.5.1 manifests installation.
What did you expect to happen: I shall able to see the Artifacts without error.
Anything else you would like to add:
- With
kubectl -n kubeflow describe deployment metadata-grpc-deployment
container:
Image: gcr.io/tfx-oss-public/ml_metadata_store_server:1.5.0
- The
metadata-grpc-deployment
pod shows error in log and the output ofkubectl -n kubeflow describe pod -l component=metadata-grpc-server
is
I0718 11:21:46.510772 1 metadata_store_server_main.cc:258] Server listening on 0.0.0.0:8080
W0718 12:30:54.066205 173 metadata_store_service_impl.cc:432] Failed to connect to the database: mysql_real_connect failed: errno: , error:
W0718 12:30:54.066272 171 metadata_store_service_impl.cc:432] Failed to connect to the database: mysql_real_connect failed: errno: , error:
- I was able to login to the mysql pod with
kubectl -n kubeflow -it exec mysql-b746975b5-lt2n9 -- /bin/bash
, and also able to login to themetadb
with default userroot
and password from the secretmysql-secret
mysql -D metadb -u root -p ""
mysql > use metadb
mysql > show tables;
mysql > quit
The mysql service also seems to be fine.
- I also tried to re-deploy
mysql
, andmetadata-grpc-deployment
and it doesn't help either
kubectl rollout restart deployment mysql -n kubeflow
kubectl rollout restart deployment metadata-grpc-deployment -n kubeflow
my other on-prem KF 1.4.0 manifest with gcr.io/tfx-oss-public/ml_metadata_store_server:1.0.0
doesn't seem to have this issue, it might be related to image gcr.io/tfx-oss-public/ml_metadata_store_server:1.5.0
Environment:
- Kubeflow version: (version number can be found at the bottom left corner of the Kubeflow dashboard): KF manifests 1.5.1
- kfctl version: (use
kfctl version
): none - Kubernetes platform: (e.g.
minikube
) microk8s 3 nodes cluster - Kubernetes version: (use
kubectl version
): Client Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.3", GitCommit:"aef86a93758dc3cb2c658dd9657ab4ad4afc21cb", GitTreeState:"clean", BuildDate:"2022-07-14T02:31:37Z", GoVersion:"go1.18.3", Compiler:"gc", Platform:"linux/amd64"} Kustomize Version: v4.5.4 Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.12-3+6937f71915b56b", GitCommit:"6937f71915b56b6004162b7c7b3f11f196100d44", GitTreeState:"clean", BuildDate:"2022-04-28T11:11:24Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"} - OS (e.g. from
/etc/os-release
): NAME="Ubuntu" VERSION="20.04.3 LTS (Focal Fossa)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 20.04.3 LTS" VERSION_ID="20.04"
I have the same issue with Kubeflow 1.6.0. No issue with connecting to mysql from shell on metadata-grpc-deployment
container but connection attempts from metadata_store_service_impl.cc
are failing.
Hi, I am also seeing the same issue on KF 1.5.0. This is still with the gcr.io/tfx-oss-public/ml_metadata_store_server:1.0.0
though. So perhaps it has to do with some other part of the manifest between 1.4 -> 1.5.
same for me seeing in v1.6
Still experience this, after i upgrade to KF v1.6.1 manifests.
And strangely, the url _/pipeline/?ns=kubeflow-1#/artifacts
shows the error,
and as i forwarded from execution tab to the url /_/pipeline/?ns=kubeflow-1#/artifacts/1
, it works.
I can access the Overview
and the Lineage Explorer
for a particular artifact. Maybe the query for the main artifacts page is broken for multi-tenancy?
+1
Also seeing this here with 1.6.1 - an interesting thing though is that if I do a deployment from scratch the error doesn't seem to trigger until some amount of time has passed. Has anyone else experienced this?
Also seeing this here with 1.6.1 - an interesting thing though is that if I do a deployment from scratch the error doesn't seem to trigger until some amount of time has passed. Has anyone else experienced this?
i am experiencing this
+1
+1
Hey, so a little update on what I had found. It seems like this might actually be caused by istio, at least for my deployment in azure aks (might have something to do with the kubeflow namespace being labeled a "control plane"). What I did to at least temporarily handle the problem while a better solution was being worked up was set the DestinationRules tls mode from ISTIO_MUTUAL
to DISABLE
. I wouldn't necessarily recommend doing these changes in all cases, but in our case we said it was fine.
For further details, it just seems like the istio certificate issuer can't update the certificates that are being pushed out for the ISTIO_MUTAL
tls mode for whatever reason. So that's why after some time the services seem to stop working. Hope this helps and if I ever have time to come up with a better solution than turning the tls off I'll try to remember to come back to this
I've resolved this issue in a different way: I've increased the max_connections
parameter for MySQL from 151
to 300
. Because Kubeflow Artifact page tries to load all artifacts from all users - I had 268 items (limit in MySQL was 151).
I think the best solution is to fix UI and show only artifacts of the current user and make pagination in order not to load all artifacts.
The same issue in https://github.com/kubeflow/pipelines/issues/8844, but in our case I had to increase max_connections
to 500 and it's definitely not a solution, because the number of artifacts is constantly growing.
/close
There has been no activity for a long time. Please reopen if necessary.
@juliusvonkohout: Closing this issue.
In response to this:
/close
There has been no activity for a long time. Please reopen if necessary.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.