pipelines icon indicating copy to clipboard operation
pipelines copied to clipboard

WIP: feat(backend): implement setting ttl on pipelines. Fixes: #10899

Open gregsheremeta opened this issue 1 year ago • 5 comments

Description of your changes: Fixes: #10899

This is a WIP PR that I'm posting mostly to get help from KFP maintainers who can guide me further.

In KFP v1, there was a function kfp.dsl.PipelineConf().set_ttl_seconds_after_finished(seconds) that garbage collected completed pods after a specified time. However, in KFP v2, this function has been deprecated. The entire PipelineConf is deprecated, actually, and I don't see a replacement.

I had some success with editing the pipeline proto to add a pipeline-scope ttl config option, and then adding support for that in the argo compiler. I am guessing that the platform config is the right place to put it (need guidance, please).

In this WIP, when I add the following to the end of a pipeline, the Workflow (and its pods) are deleted after 2 minutes:

---
platforms:
  kubernetes:
    deploymentSpec:
      completedPipelineTtl: 120

(completedPipelineTtl is the new field I added in the proto. Previously there was only a map of configs for the tasks, so it used to look like this:

---
platforms:
  kubernetes:
    deploymentSpec:
      executors:
        exec-comp-1:
            <snip>
        exec-comp-2:
            <snip>
        ...

)

Now the hard part: I'm not sure how to proceed with changing the SDK/DSL to get it to output that pipeline spec.

It seems like using kfp-kubernetes is the correct way to go because this is a platform config option we want to set. However, there are no examples of doing pipeline-scope config -- all of the current kfp-kubernetes functions work on tasks, not pipelines. I poked at using something like this

https://github.com/kubeflow/pipelines/blob/bf5104fcff6a6c2db8d8e39522c04eca1bb3fc93/kubernetes_platform/python/kfp/kubernetes/common.py#L19-L23

but for Pipelines, with the hopes of it looking like this in the DSL:

@dsl.pipeline
def my_ttl_pipelines():
    task = comp()
    kubernetes.set_completed_pipeline_ttl(
        pipeline, completed_pipeline_ttl=120)   # delete pods after 2 minutes

But, the Pipeline class is pretty barren and I don't see a way to get at platform_config from it.

I think I'm blocked until I get some pointers from folks much more familiar with the DSL and kfp-kubernetes.

@connor-mccarthy @chensun please help or suggest a DSL expert who can.

gregsheremeta avatar Jun 21 '24 19:06 gregsheremeta

Hi @gregsheremeta. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

google-oss-prow[bot] avatar Jun 21 '24 19:06 google-oss-prow[bot]

Hi, I need some pointers from folks much more familiar with the DSL and kfp-kubernetes and protobuf. See my initial PR message.

@connor-mccarthy @chensun @james-jwu @zijianjoy please help or suggest a DSL / protobuf / kfp-kubernetes expert who can assist. Thanks.

gregsheremeta avatar Jun 25 '24 22:06 gregsheremeta

This should not go in the k8s platform spec, by the name it may give that impression, but when we look at the proto file for it here we can see that the primary use case for this is to provide sdk api for manipulating k8s native objects (e.g. pod spec, pvc, secret, configmaps, etc).

The ttl should be a pipeline abstraction, the k8s specific implementation should be handled compiler side which you have already done. I think a better location would be to add this to the deployment_spec which seems like it's intended for pipeline level configurations

HumairAK avatar Jun 26 '24 18:06 HumairAK

Hm, ok, that line of thinking makes sense. Let me try that. Thanks!

gregsheremeta avatar Jun 26 '24 18:06 gregsheremeta

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign chensun for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

google-oss-prow[bot] avatar Jul 01 '24 18:07 google-oss-prow[bot]

Note: This is functioning end to end. This is currently WIP only because I'd like to get an ack from a maintainer on the approach, at which point I'll split the proto changes into a separate PR, and then rebase this on top of that once it's merged. I'll then also add tests.

@chensun / @connor-mccarthy could we please get an ack on this approach, nameply the proto addition here so we can proceed with the above?

HumairAK avatar Jul 02 '24 19:07 HumairAK

/ok-to-test

hbelmiro avatar Jul 03 '24 12:07 hbelmiro

New changes are detected. LGTM label has been removed.

google-oss-prow[bot] avatar Jul 04 '24 00:07 google-oss-prow[bot]

@gregsheremeta: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
kubeflow-pipeline-upgrade-test afa1286f4de3950509bc8d1c3b319773f7286f94 link false /test kubeflow-pipeline-upgrade-test
kfp-kubernetes-test-python310 afa1286f4de3950509bc8d1c3b319773f7286f94 link true /test kfp-kubernetes-test-python310
kfp-kubernetes-execution-tests afa1286f4de3950509bc8d1c3b319773f7286f94 link false /test kfp-kubernetes-execution-tests
kubeflow-pipelines-sdk-docformatter afa1286f4de3950509bc8d1c3b319773f7286f94 link true /test kubeflow-pipelines-sdk-docformatter
kubeflow-pipelines-sdk-isort afa1286f4de3950509bc8d1c3b319773f7286f94 link true /test kubeflow-pipelines-sdk-isort
kubeflow-pipelines-sdk-yapf afa1286f4de3950509bc8d1c3b319773f7286f94 link true /test kubeflow-pipelines-sdk-yapf
kubeflow-pipelines-samples-v2 afa1286f4de3950509bc8d1c3b319773f7286f94 link false /test kubeflow-pipelines-samples-v2
kubeflow-pipelines-sdk-execution-tests afa1286f4de3950509bc8d1c3b319773f7286f94 link true /test kubeflow-pipelines-sdk-execution-tests
kfp-kubernetes-test-python311 afa1286f4de3950509bc8d1c3b319773f7286f94 link true /test kfp-kubernetes-test-python311
kfp-kubernetes-test-python38 afa1286f4de3950509bc8d1c3b319773f7286f94 link true /test kfp-kubernetes-test-python38
kfp-kubernetes-test-python39 afa1286f4de3950509bc8d1c3b319773f7286f94 link true /test kfp-kubernetes-test-python39
kfp-kubernetes-test-python312 afa1286f4de3950509bc8d1c3b319773f7286f94 link true /test kfp-kubernetes-test-python312
kubeflow-pipelines-sdk-python310 afa1286f4de3950509bc8d1c3b319773f7286f94 link true /test kubeflow-pipelines-sdk-python310
kubeflow-pipelines-sdk-python38 afa1286f4de3950509bc8d1c3b319773f7286f94 link true /test kubeflow-pipelines-sdk-python38
kubeflow-pipelines-sdk-python39 afa1286f4de3950509bc8d1c3b319773f7286f94 link true /test kubeflow-pipelines-sdk-python39
kubeflow-pipelines-sdk-python311 afa1286f4de3950509bc8d1c3b319773f7286f94 link true /test kubeflow-pipelines-sdk-python311
kubeflow-pipelines-sdk-python312 afa1286f4de3950509bc8d1c3b319773f7286f94 link true /test kubeflow-pipelines-sdk-python312

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

google-oss-prow[bot] avatar Jul 04 '24 00:07 google-oss-prow[bot]

To summarize my understanding of @zijianjoy 's comments from July 3rd KFP call (last topic discussed):

There are various levels where we could implement garbage collection of underlying pods:

  1. At the Deployment level (i.e. when deploying kfp, set a ttl option for all pods)
  2. At the Pipeline level (i.e. when uploading a pipeline, and/or for any runs created for a pipeline)
  3. At the Pipeline Run level (set ttl for individual Pipeline Runs)
  4. At the task level

As well as being able to do 2,3,4 via the SDK or UI, or some combination thereof.

What exists to day is to do (4) via the sdk.

This PR allows users to do do (2) via the sdk.

As it is very annoying for users to do this for every task. The implementation is extensible that we could easily add the capability to do (3)/(2) as well though this would likely require additional db columns for pipelines (for detecting ttl, then simply checking for this during compilation). For (4) we could simply have a config option (also check for this during compilation).

I think this addition is simple enough that we need not worry about growing complexity of adding support for these varying levels in the future.

@zijianjoy / @chensun please let us know your thoughts on the above

HumairAK avatar Jul 04 '24 14:07 HumairAK

I think TTL after completion at pod level makes most sense. This can be configured at deployment time. And if we want to override the deployment setting for individual pipelines/tasks, setting it via kfp-kubernetes extension and explicitly calling out it's TTL for pod would be my preference.

My rational is I think TTL after completion at pipeline (and pipeline run) level is kind of misleading. We're not cleaning up pipeline run records from the database, post TTL runs still shows up when listing runs via UI or SDK. The only thing get garbage collected is their pods (or maybe some Argo workflow record in addition to the pods).

chensun avatar Jul 31 '24 07:07 chensun

Thanks @chensun, great points! I agree that completed_pipeline_ttl here is definitely misleading.

setting it via kfp-kubernetes extension and explicitly calling out it's TTL for pod would be my preference.

However while the end result is pod clean up, the implementation uses the backing engine's TTL api. I.e. it is not directly interacting with k8s native resources, but rather via the engine's crd api. Based on the kuberenetes-extension proto, it seems to me to be highly geared towards kubernetes native components at the task pod level. Adding a field that configures the backing pipeline engines ttl strategy doesn't seem to fit in with the rest of its api IMO.

As you mentioned in Argo's case the Workflow resource is also cleaned up. So something like pod_ttl wouldn't be fully accurate either.

I think @gregsheremeta plans to explore the kubernetes-extension approach again, but maybe we can use a different name for the api field so as it is more clear of its intentions, regardless of whether it's in kubernetes_platform or pipeline_spec.

While we look into this it would be useful to know how strongly opinionated you are towards this change going into pipelines_spec (with a less misleading field_name) vs kubernetes_platform.

HumairAK avatar Jul 31 '24 22:07 HumairAK

I've extracted the PipelineConfig addition to its own PR, #11112.

Regarding the TTL portion of this PR, I don't see anything in Argo Workflows or kubernetes to allow setting TTLs or self-destructs on individual pods. For pod-level TTL, I think we would have to implement a custom controller, or find one that already exists that we could reuse.

Alternatively, we can stick with pipeline-level TTL, but we'd have to clean up database rows (runs, maybe other stuff?) in addition to setting the Workflow TTL field. This option sounds preferable to me.

Closing this PR while I research the above and wait for #11112 to land.

gregsheremeta avatar Aug 17 '24 17:08 gregsheremeta

I think TTL after completion at pod level makes most sense. This can be configured at deployment time. And if we want to override the deployment setting for individual pipelines/tasks, setting it via kfp-kubernetes extension and explicitly calling out it's TTL for pod would be my preference.

My rational is I think TTL after completion at pipeline (and pipeline run) level is kind of misleading. We're not cleaning up pipeline run records from the database, post TTL runs still shows up when listing runs via UI or SDK. The only thing get garbage collected is their pods (or maybe some Argo workflow record in addition to the pods).

Maybe we can disambiguate between etcd TTL (for runs) and KFP database TTL (for runs). I think it's important to be able to configure both separately.

We use the KFP database for long term storage, reproducibility, and model governance. As a result, we need either long-term or indefinite TTL in the KFP database.

In constrast, when it comes to etcd, reduced TTL is advisable. A large volume of non-GCed Argo Workflow custom resources can lead to K8s API latency and can even crash a cluster. Don't ask me how I know. Okay, fine: we were testing a scheduled run that created a new run / workflow every 10 seconds and there was no TTL.

etcd / Workflow CR TTL was configurable in v1. IMHO, that gap should be closed in v2, and is probably a smaller, self-contained issue that can be decoupled from database TTL and pod-specific TTL (both of which are meaningful issues in their own right ofc).

Side note: something to be aware of with pod TTL is that if Workflow TTL is less than pod TTL, the pod will be GCed earlier than anticipated since the pods have an owner reference to the Workflow CR, i.e. deletions cascade.

Also, FWIW, the people concerned about database cleanup are usually platform admins, not end users. As a result, end users will likely not really care about database TTL. With that in mind, it might make more sense to handle database cleanup through an external process that platform admins can control.

Apologies if any of the above is stating the obvious!

droctothorpe avatar Dec 04 '24 18:12 droctothorpe

Maybe we can disambiguate between etcd TTL (for runs) and KFP database TTL (for runs). I think it's important to be able to configure both separately.

This was in the back of my mind, but I didn't think it was important :) Happy to hear a different perspective, though.

That said, I strongly dislike when we leak implementation details into the KFP SDK. I already dislike that we have to leak setting a TTL into the SDK. I'll dislike it even more if we make it even leakier by going from set_ttl() to set_database_ttl() and set_argo_workflow_ttl() :vomiting_face:

gregsheremeta avatar Dec 05 '24 12:12 gregsheremeta

Design Doc:

Kubeflow Pipelines - Design - Implement TTL for pipeline runs - Google Docs https://docs.google.com/document/d/1Aakg7udmM_YLEU7gqGi9PbowTZDscy_zzVRx0CcAXeQ/edit

gregsheremeta avatar Dec 05 '24 12:12 gregsheremeta

Will continue the convo in the design doc.

droctothorpe avatar Dec 05 '24 14:12 droctothorpe

I'll dislike it even more if we make it even leakier by going from set_ttl() to set_database_ttl() and set_argo_workflow_ttl()

I don't think DB Cleanup should be something to set in SDK, but an env var in the Persistence Agent. My idea is a flag to keep run history on database with a default value set to true, and admins who don't care much about the run history data persisted in DB they just change it to false.

rimolive avatar Dec 05 '24 14:12 rimolive