pipelines icon indicating copy to clipboard operation
pipelines copied to clipboard

[backend] Cannot list artifacts

Open pablofiumara opened this issue 2 years ago • 7 comments

Environment

  • How did you deploy Kubeflow Pipelines (KFP)?

Using https://www.kubeflow.org/docs/distributions/gke/deploy/upgrade/

  • KFP version: ml-pipeline/frontend:1.8.1 ml-pipeline/api-server:1.8.1

Steps to reproduce

Upgrade from Kubeflow 1.3 to Kubeflow 1.5 allows to replicate the problem

Expected result

I expect to be able to see a list of artifacts when I access myClusterURL/pipeline/artifacts. Instead I get this https://user-images.githubusercontent.com/74205824/186285977-cba538c2-e496-416e-8f27-67fa4950b4cc.png

Materials and Reference


Impacted by this bug? Give it a 👍.

pablofiumara avatar Aug 24 '22 19:08 pablofiumara

Have you checked that both ml-pipeline-ui deployment in kubeflow namespace and the ml-pipeline-ui-artifact deployment in user namespaces are all using ml-pipeline/frontend:1.8.1?

zijianjoy avatar Aug 25 '22 22:08 zijianjoy

@zijianjoy Yes, I have

Name:                   ml-pipeline-ui
Namespace:              kubeflow
CreationTimestamp:      Wed, 23 Jun 2021 21:52:54 -0300
Labels:                 app=ml-pipeline-ui
                        app.kubernetes.io/component=ml-pipeline
                        app.kubernetes.io/name=kubeflow-pipelines
Annotations:            deployment.kubernetes.io/revision: 21
Selector:               app=ml-pipeline-ui,app.kubernetes.io/component=ml-pipeline,app.kubernetes.io/name=kubeflow-pipelines
Replicas:               1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  25% max unavailable, 25% max surge
Pod Template:
  Labels:           app=ml-pipeline-ui
                    app.kubernetes.io/component=ml-pipeline
                    app.kubernetes.io/name=kubeflow-pipelines
  Annotations:      cluster-autoscaler.kubernetes.io/safe-to-evict: true
                    kubectl.kubernetes.io/restartedAt: 2022-08-25T18:19:01-03:00
  Service Account:  ml-pipeline-ui
  Containers:
   ml-pipeline-ui:
    Image:      gcr.io/ml-pipeline/frontend:1.8.1
    Port:       3000/TCP
    Host Port:  0/TCP
    Requests:
      cpu:      10m
      memory:   70Mi
    Liveness:   exec [wget -q -S -O - http://localhost:3000/apis/v1beta1/healthz] delay=3s timeout=2s period=5s #success=1 #failure=3
    Readiness:  exec [wget -q -S -O - http://localhost:3000/apis/v1beta1/healthz] delay=3s timeout=2s period=5s #success=1 #failure=3
    Environment:
      KUBEFLOW_USERID_HEADER:                     <set to the key 'userid-header' of config map 'kubeflow-config'>  Optional: false
      KUBEFLOW_USERID_PREFIX:                     <set to the key 'userid-prefix' of config map 'kubeflow-config'>  Optional: false
      VIEWER_TENSORBOARD_POD_TEMPLATE_SPEC_PATH:  /etc/config/viewer-pod-template.json
      DEPLOYMENT:                                 KUBEFLOW
      ARTIFACTS_SERVICE_PROXY_NAME:               ml-pipeline-ui-artifact
      ARTIFACTS_SERVICE_PROXY_PORT:               80
      ARTIFACTS_SERVICE_PROXY_ENABLED:            true
      ENABLE_AUTHZ:                               true
      MINIO_NAMESPACE:                             (v1:metadata.namespace)
      MINIO_ACCESS_KEY:                           <set to the key 'accesskey' in secret 'mlpipeline-minio-artifact'>  Optional: false
      MINIO_SECRET_KEY:                           <set to the key 'secretkey' in secret 'mlpipeline-minio-artifact'>  Optional: false
      ALLOW_CUSTOM_VISUALIZATIONS:                true
    Mounts:
      /etc/config from config-volume (ro)
  Volumes:
   config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      ml-pipeline-ui-configmap
    Optional:  false
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Available      True    MinimumReplicasAvailable
  Progressing    True    NewReplicaSetAvailable
OldReplicaSets:  <none>
NewReplicaSet:   ml-pipeline-ui-oneId (1/1 replicas created)
Events:          <none>

Name:                   ml-pipeline-ui-artifact
Namespace:              myNamespace
CreationTimestamp:      Mon, 13 Jun 2022 17:20:27 -0300
Labels:                 app=ml-pipeline-ui-artifact
                        controller-uid=34641e66-4d49-4025-b235-fc433a8e2049
Annotations:            deployment.kubernetes.io/revision: 4
                        metacontroller.k8s.io/last-applied-configuration:
                          {"apiVersion":"apps/v1","kind":"Deployment","metadata":{"labels":{"app":"ml-pipeline-ui-artifact","controller-uid":"34641e66-4d49-4025-b23...
Selector:               app=ml-pipeline-ui-artifact
Replicas:               1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  25% max unavailable, 25% max surge
Pod Template:
  Labels:           app=ml-pipeline-ui-artifact
  Annotations:      kubectl.kubernetes.io/restartedAt: 2022-08-23T18:23:11-03:00
  Service Account:  default-editor
  Containers:
   ml-pipeline-ui-artifact:
    Image:      gcr.io/ml-pipeline/frontend:1.8.1
    Port:       3000/TCP
    Host Port:  0/TCP
    Limits:
      cpu:     100m
      memory:  500Mi
    Requests:
      cpu:     10m
      memory:  70Mi
    Environment:
      MINIO_ACCESS_KEY:  <set to the key 'accesskey' in secret 'mlpipeline-minio-artifact'>  Optional: false
      MINIO_SECRET_KEY:  <set to the key 'secretkey' in secret 'mlpipeline-minio-artifact'>  Optional: false
    Mounts:              <none>
  Volumes:               <none>
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Available      True    MinimumReplicasAvailable
  Progressing    True    NewReplicaSetAvailable
OldReplicaSets:  <none>
NewReplicaSet:   ml-pipeline-ui-artifact-bb5bc4b57 (1/1 replicas created)
Events:          <none>

What else can I check?

pablofiumara avatar Aug 25 '22 23:08 pablofiumara

If I go to myCluster/ml_metadata.MetadataStoreService/GetEventsByArtifactIDs, I get the message

upstream connect error or disconnect/reset before headers. reset reason: remote reset

Using asm-1143-0

pablofiumara avatar Aug 26 '22 00:08 pablofiumara

ml-metadata has been upgraded from 1.0.0 to 1.5.0 when Kubeflow is upgraded from 1.3 to 1.5. https://github.com/kubeflow/pipelines/commits/master/third_party/ml-metadata

As a result, MLMD schema version has been changed. So you need to follow the instruction to upgrade MLMD dependency: https://github.com/google/ml-metadata/blob/master/g3doc/get_started.md#upgrade-the-mlmd-library

zijianjoy avatar Aug 26 '22 03:08 zijianjoy

@zijianjoy Thank you very much for your answer. If I execute

kubectl describe deployment metadata-grpc-deployment -n kubeflow

I get


Name:                   metadata-grpc-deployment
Namespace:              kubeflow
CreationTimestamp:      Wed, 23 Jun 2021 21:52:53 -0300
Labels:                 component=metadata-grpc-server
Annotations:            deployment.kubernetes.io/revision: 27
Selector:               component=metadata-grpc-server
Replicas:               1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  25% max unavailable, 25% max surge
Pod Template:
  Labels:           component=metadata-grpc-server
  Annotations:      kubectl.kubernetes.io/restartedAt: 2022-08-26T16:44:45-03:00
  Service Account:  metadata-grpc-server
  Containers:
   container:
    Image:      gcr.io/tfx-oss-public/ml_metadata_store_server:1.5.0
    Port:       8080/TCP
    Host Port:  0/TCP
    Command:
      /bin/metadata_store_server
    Args:
      --grpc_port=8080
      --mysql_config_database=$(MYSQL_DATABASE)
      --mysql_config_host=$(MYSQL_HOST)
      --mysql_config_port=$(MYSQL_PORT)
      --mysql_config_user=$(DBCONFIG_USER)
      --mysql_config_password=$(DBCONFIG_PASSWORD)
      --enable_database_upgrade=true
    Liveness:   tcp-socket :grpc-api delay=3s timeout=2s period=5s #success=1 #failure=3
    Readiness:  tcp-socket :grpc-api delay=3s timeout=2s period=5s #success=1 #failure=3
    Environment:
      DBCONFIG_USER:      <set to the key 'username' in secret 'mysql-secret'>               Optional: false
      DBCONFIG_PASSWORD:  <set to the key 'password' in secret 'mysql-secret'>               Optional: false
      MYSQL_DATABASE:     <set to the key 'mlmdDb' of config map 'pipeline-install-config'>  Optional: false
      MYSQL_HOST:         <set to the key 'dbHost' of config map 'pipeline-install-config'>  Optional: false
      MYSQL_PORT:         <set to the key 'dbPort' of config map 'pipeline-install-config'>  Optional: false
    Mounts:               <none>
  Volumes:                <none>
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Available      True    MinimumReplicasAvailable
  Progressing    True    NewReplicaSetAvailable
OldReplicaSets:  <none>
NewReplicaSet:   metadata-grpc-deployment-56779cf65 (1/1 replicas created)
Events:
  Type    Reason             Age    From                   Message
  ----    ------             ----   ----                   -------
  Normal  ScalingReplicaSet  50m    deployment-controller  Scaled up replica set metadata-grpc-deployment-bb6856f48 to 1
  Normal  ScalingReplicaSet  48m    deployment-controller  Scaled down replica set metadata-grpc-deployment-58c7dbcd8b to 0
  Normal  ScalingReplicaSet  39m    deployment-controller  Scaled up replica set metadata-grpc-deployment-6cc4b76c8d to 1
  Normal  ScalingReplicaSet  38m    deployment-controller  Scaled down replica set metadata-grpc-deployment-bb6856f48 to 0
  Normal  ScalingReplicaSet  36m    deployment-controller  Scaled up replica set metadata-grpc-deployment-8c74d44b5 to 1
  Normal  ScalingReplicaSet  35m    deployment-controller  Scaled down replica set metadata-grpc-deployment-6cc4b76c8d to 0
  Normal  ScalingReplicaSet  2m53s  deployment-controller  Scaled up replica set metadata-grpc-deployment-56779cf65 to 1
  Normal  ScalingReplicaSet  2m19s  deployment-controller  Scaled down replica set metadata-grpc-deployment-8c74d44b5 to 0

Does this mean MLMD dependency version is correct? What am I missing?

pablofiumara avatar Aug 26 '22 19:08 pablofiumara

You need to upgrade the MLMD database schema: https://github.com/google/ml-metadata/blob/master/g3doc/get_started.md#upgrade-the-database-schema

zijianjoy avatar Aug 28 '22 05:08 zijianjoy

There is a tool for MLMD upgrade: https://github.com/kubeflow/pipelines/blob/74c7773ca40decfd0d4ed40dc93a6af591bbc190/tools/metadatastore-upgrade/README.md

zijianjoy avatar Sep 06 '22 21:09 zijianjoy

Hi @zijianjoy, Our cluster is a freshly installed 1.5.0 kubeflow cluster.

We also see the below error page when accessing myClusterURL/pipeline/artifacts. image

In the beginning, the artifacts page can be loaded successfully, but after we ran about 600 recurring runs, the artifacts page failed to load with the above message.

Even we removed all the content under mlpipeline/artifacts/ path in minio. The artifacts page still failed to load with the error.

Is there any way to recover? Thanks!

celiawa avatar Nov 02 '22 05:11 celiawa

@celiawa Currently it is listing all artifacts from MLMD store. Even if you deleted the content in MinIO, the MLMD store doesn't delete the corresponding MLMD object. It is likely a timeout trying to list all the artifacts. There is a plan to improve this page https://github.com/kubeflow/pipelines/issues/3226

zijianjoy avatar Nov 02 '22 19:11 zijianjoy

Thanks @zijianjoy. I checked the mysql got MLMD store, there're many tables in it. Which tables we shall delete to recover our artifacts page back. We don't want to reinstall.

celiawa avatar Nov 03 '22 08:11 celiawa

Hi @zijianjoy @celiawa I am also facing the same issue, unable to see the Artifacts in Kubeflow, let me know solution to fix the same

subasathees avatar Sep 12 '23 12:09 subasathees

Upgrading KFP to the latest version should allow you to see paginated artifact list now.

zijianjoy avatar Sep 12 '23 15:09 zijianjoy

Thanks @zijianjoy, we upgraded to kfp version 2.01 and can see artifact list pagination now.

celiawa avatar Sep 13 '23 08:09 celiawa

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Mar 03 '24 07:03 github-actions[bot]

Closing this issue as it seems the issue is solved.

/close

rimolive avatar Mar 12 '24 07:03 rimolive

@rimolive: Closing this issue.

In response to this:

Closing this issue as it seems the issue is solved.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

google-oss-prow[bot] avatar Mar 12 '24 07:03 google-oss-prow[bot]