kubeflow icon indicating copy to clipboard operation
kubeflow copied to clipboard

KF 1.5 metadata-grpc-deployment "Failed to connect to the database: mysql_real_connect failed"

Open yingding opened this issue 2 years ago • 3 comments

/kind bug

What steps did you take and what happened: I can't see the Artifacts from KF 1.5 Webapp from my KF 1.5.1 manifests installation. ArtifactError

What did you expect to happen: I shall able to see the Artifacts without error.

Anything else you would like to add:

  1. With kubectl -n kubeflow describe deployment metadata-grpc-deployment
   container:
    Image:      gcr.io/tfx-oss-public/ml_metadata_store_server:1.5.0
  1. The metadata-grpc-deployment pod shows error in log and the output of kubectl -n kubeflow describe pod -l component=metadata-grpc-server is
I0718 11:21:46.510772     1 metadata_store_server_main.cc:258] Server listening on 0.0.0.0:8080
W0718 12:30:54.066205   173 metadata_store_service_impl.cc:432] Failed to connect to the database: mysql_real_connect failed: errno: , error: 
W0718 12:30:54.066272   171 metadata_store_service_impl.cc:432] Failed to connect to the database: mysql_real_connect failed: errno: , error:
  1. I was able to login to the mysql pod with kubectl -n kubeflow -it exec mysql-b746975b5-lt2n9 -- /bin/bash, and also able to login to the metadb with default user root and password from the secret mysql-secret
mysql -D metadb -u root -p ""

mysql > use metadb
mysql > show tables;
mysql > quit

The mysql service also seems to be fine.

  1. I also tried to re-deploy mysql, and metadata-grpc-deployment and it doesn't help either
kubectl rollout restart deployment mysql -n kubeflow
kubectl rollout restart deployment metadata-grpc-deployment -n kubeflow

my other on-prem KF 1.4.0 manifest with gcr.io/tfx-oss-public/ml_metadata_store_server:1.0.0 doesn't seem to have this issue, it might be related to image gcr.io/tfx-oss-public/ml_metadata_store_server:1.5.0

Environment:

  • Kubeflow version: (version number can be found at the bottom left corner of the Kubeflow dashboard): KF manifests 1.5.1
  • kfctl version: (use kfctl version): none
  • Kubernetes platform: (e.g. minikube) microk8s 3 nodes cluster
  • Kubernetes version: (use kubectl version): Client Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.3", GitCommit:"aef86a93758dc3cb2c658dd9657ab4ad4afc21cb", GitTreeState:"clean", BuildDate:"2022-07-14T02:31:37Z", GoVersion:"go1.18.3", Compiler:"gc", Platform:"linux/amd64"} Kustomize Version: v4.5.4 Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.12-3+6937f71915b56b", GitCommit:"6937f71915b56b6004162b7c7b3f11f196100d44", GitTreeState:"clean", BuildDate:"2022-04-28T11:11:24Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}
  • OS (e.g. from /etc/os-release): NAME="Ubuntu" VERSION="20.04.3 LTS (Focal Fossa)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 20.04.3 LTS" VERSION_ID="20.04"

yingding avatar Jul 18 '22 13:07 yingding

I have the same issue with Kubeflow 1.6.0. No issue with connecting to mysql from shell on metadata-grpc-deployment container but connection attempts from metadata_store_service_impl.cc are failing.

mwoodbri avatar Sep 09 '22 08:09 mwoodbri

Hi, I am also seeing the same issue on KF 1.5.0. This is still with the gcr.io/tfx-oss-public/ml_metadata_store_server:1.0.0 though. So perhaps it has to do with some other part of the manifest between 1.4 -> 1.5.

imiller445 avatar Sep 09 '22 16:09 imiller445

same for me seeing in v1.6

dbg-raghulkrishna avatar Sep 14 '22 18:09 dbg-raghulkrishna

Still experience this, after i upgrade to KF v1.6.1 manifests. And strangely, the url _/pipeline/?ns=kubeflow-1#/artifacts shows the error, and as i forwarded from execution tab to the url /_/pipeline/?ns=kubeflow-1#/artifacts/1, it works.

I can access the Overview and the Lineage Explorer for a particular artifact. Maybe the query for the main artifacts page is broken for multi-tenancy?

yingding avatar Oct 11 '22 18:10 yingding

+1

dbg-raghulkrishna avatar Oct 11 '22 18:10 dbg-raghulkrishna

Also seeing this here with 1.6.1 - an interesting thing though is that if I do a deployment from scratch the error doesn't seem to trigger until some amount of time has passed. Has anyone else experienced this?

nkosteski avatar Oct 14 '22 19:10 nkosteski

Also seeing this here with 1.6.1 - an interesting thing though is that if I do a deployment from scratch the error doesn't seem to trigger until some amount of time has passed. Has anyone else experienced this?

i am experiencing this

dbg-raghulkrishna avatar Oct 16 '22 21:10 dbg-raghulkrishna

+1

RakeshRaj97 avatar Nov 23 '22 23:11 RakeshRaj97

+1

akravacyber avatar Jan 17 '23 22:01 akravacyber

Hey, so a little update on what I had found. It seems like this might actually be caused by istio, at least for my deployment in azure aks (might have something to do with the kubeflow namespace being labeled a "control plane"). What I did to at least temporarily handle the problem while a better solution was being worked up was set the DestinationRules tls mode from ISTIO_MUTUAL to DISABLE. I wouldn't necessarily recommend doing these changes in all cases, but in our case we said it was fine.

For further details, it just seems like the istio certificate issuer can't update the certificates that are being pushed out for the ISTIO_MUTAL tls mode for whatever reason. So that's why after some time the services seem to stop working. Hope this helps and if I ever have time to come up with a better solution than turning the tls off I'll try to remember to come back to this

nkosteski avatar Jan 18 '23 13:01 nkosteski

I've resolved this issue in a different way: I've increased the max_connections parameter for MySQL from 151 to 300. Because Kubeflow Artifact page tries to load all artifacts from all users - I had 268 items (limit in MySQL was 151).

I think the best solution is to fix UI and show only artifacts of the current user and make pagination in order not to load all artifacts.

akravacyber avatar Jan 18 '23 18:01 akravacyber

The same issue in https://github.com/kubeflow/pipelines/issues/8844, but in our case I had to increase max_connections to 500 and it's definitely not a solution, because the number of artifacts is constantly growing.

karmicdude avatar Feb 24 '23 07:02 karmicdude

/close

There has been no activity for a long time. Please reopen if necessary.

juliusvonkohout avatar Aug 25 '23 10:08 juliusvonkohout

@juliusvonkohout: Closing this issue.

In response to this:

/close

There has been no activity for a long time. Please reopen if necessary.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

google-oss-prow[bot] avatar Aug 25 '23 10:08 google-oss-prow[bot]