pipelines icon indicating copy to clipboard operation
pipelines copied to clipboard

[sdk] enable_caching breaks when using CreatePVC: must specify FingerPrint

Open TobiasGoerke opened this issue 8 months ago • 13 comments

Environment

  • KFP version: 2.0.3 (manifests v1.8 release)
  • KFP SDK version:
kfp                      2.4.0
kfp-kubernetes           1.0.0
kfp-pipeline-spec        0.2.2
kfp-server-api           2.0.3

Steps to reproduce

Given the following example:

from kfp import dsl
from kfp import kubernetes


@dsl.component
def test_step():
    print("Hello world")


@dsl.pipeline
def test_pipeline():
    kubernetes.CreatePVC(
        access_modes=["ReadWriteOnce"],
        size="10Mi",
        storage_class_name="default",
    )
    test_step()


client.create_run_from_pipeline_func(test_pipeline, arguments={}, enable_caching=False)

The pipeline will fail. Note the enable_caching, which will cause the issue when set to False.

We will see an error in the created PVC step:

F1031 14:29:54.216337 27 main.go:76] KFP driver: driver.Container(pipelineName=test-pipeline, runID=02ad61d6-8b9b-47a7-b626-0d65f3838b42, task="createpvc", component="comp-createpvc", dagExecutionID=9094, componentSpec) failed: failed to create PVC and publish execution createpvc: failed to create cache entrty for create pvc: failed to create task: rpc error: code = InvalidArgument desc = Failed to create a new task due to validation error: Invalid input error: Invalid task: must specify FingerPrint
time="2023-10-31T14:29:54.940Z" level=info msg="sub-process exited" argo=true error="<nil>"
time="2023-10-31T14:29:54.940Z" level=error msg="cannot save parameter /tmp/outputs/pod-spec-patch" argo=true error="open /tmp/outputs/pod-spec-patch: no such file or directory"
time="2023-10-31T14:29:54.940Z" level=error msg="cannot save parameter /tmp/outputs/cached-decision" argo=true error="open /tmp/outputs/cached-decision: no such file or directory"
time="2023-10-31T14:29:54.940Z" level=error msg="cannot save parameter /tmp/outputs/condition" argo=true error="open /tmp/outputs/condition: no such file or directory"
Error: exit status 1

Impacted by this bug? Give it a 👍.

TobiasGoerke avatar Oct 31 '23 14:10 TobiasGoerke

@TobiasGoerke what is the version of your KFP runtime? Maybe there is a bug when resolving cache key in the PVC creation operation. cc @chensun to learn more.

zijianjoy avatar Nov 02 '23 22:11 zijianjoy

@TobiasGoerke what is the version of your KFP runtime? Maybe there is a bug when resolving cache key in the PVC creation operation. cc @chensun to learn more.

I'm on manifests/v1.8-branch, i.e. 2.0.3.

TobiasGoerke avatar Nov 03 '23 07:11 TobiasGoerke

I am also facing the exactly same issue with the same output on KFP backend 2.0.3 with Kubeflow 1.8.0 manifests deployment. The PVC is created, but the component reported the error from the logs and exist with error.

F1117 21:35:33.015147      22 main.go:76] KFP driver: driver.Container(pipelineName=my-pipeline, runID=cd147529-1b6c-454b-b3e1-b2858ff98222, task="createpvc", component="comp-createpvc", dagExecutionID=29, componentSpec) failed: failed to create PVC and publish execution createpvc: failed to create cache entrty for create pvc: failed to create task: rpc error: code = InvalidArgument desc = Failed to create a new task due to validation error: Invalid input error: Invalid task: must specify FingerPrint
time="2023-11-17T21:35:33.321Z" level=info msg="sub-process exited" argo=true error="<nil>"
time="2023-11-17T21:35:33.322Z" level=error msg="cannot save parameter /tmp/outputs/pod-spec-patch" argo=true error="open /tmp/outputs/pod-spec-patch: no such file or directory"
time="2023-11-17T21:35:33.322Z" level=error msg="cannot save parameter /tmp/outputs/cached-decision" argo=true error="open /tmp/outputs/cached-decision: no such file or directory"
time="2023-11-17T21:35:33.322Z" level=error msg="cannot save parameter /tmp/outputs/condition" argo=true error="open /tmp/outputs/condition: no such file or directory"
Error: exit status 1

yingding avatar Nov 17 '23 21:11 yingding

Just want to add some additional info. After experiencing this issue, kfp backend didn't work anymore in my case. I have to restart all the deployments kubectl -n kubeflow rollout restart deployments to be able to run v2 pipeline again.

yingding avatar Nov 19 '23 12:11 yingding

With the api-server 2.0.5 with enable_caching=False, this issue still exists.

  • KFP Backend API-SERVER version: 2.0.5 (manifests v1.8 release modified)
  • KFP SDK version:
kfp                      2.4.0
kfp-kubernetes           1.0.0
kfp-pipeline-spec        0.2.2
kfp-server-api           2.0.5

yingding avatar Jan 03 '24 19:01 yingding

With the api-server 2.0.5 with enable_caching=False, this issue still exists.

  • KFP Backend API-SERVER version: 2.0.5 (manifests v1.8 release modified)
  • KFP SDK version:
kfp                      2.4.0
kfp-kubernetes           1.0.0
kfp-pipeline-spec        0.2.2
kfp-server-api           2.0.5

@yingding finally, it's working fine?

kabartay avatar Jan 29 '24 18:01 kabartay

@kabartay Unfortunately, this issue still exists, even with

  • KFP Backend API-SERVER version: 2.0.5 (manifests v1.8 release modified)
  • KFP SDK version:
kfp                           2.6.0
kfp-kubernetes                1.1.0
kfp-pipeline-spec             0.3.0
kfp-server-api                2.0.5

Hopefully, it can be resolved in the next KFP backend API SERVER.

yingding avatar Jan 29 '24 22:01 yingding

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Mar 30 '24 07:03 github-actions[bot]

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

github-actions[bot] avatar Apr 21 '24 07:04 github-actions[bot]

/reopen

Seems this issue has not been resolved, yet.

AnnKatrinBecker avatar May 14 '24 07:05 AnnKatrinBecker

@AnnKatrinBecker: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Seems this issue has not been resolved, yet.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

google-oss-prow[bot] avatar May 14 '24 07:05 google-oss-prow[bot]

/reopen

HumairAK avatar May 14 '24 14:05 HumairAK

@HumairAK: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

google-oss-prow[bot] avatar May 14 '24 14:05 google-oss-prow[bot]