pipelines icon indicating copy to clipboard operation
pipelines copied to clipboard

feat(Backend + SDK): Update kfp backend and kubernetes sdk to support EmptyDir

Open gregsheremeta opened this issue 1 year ago • 10 comments

Description of your changes: Update kfp backend and kubernetes sdk to support mounting EmptyDir volumes to task pods.

(Based on #10892 and needs rebase after that is merged first)

Inspired by #10427

Fixes: #10656

Checklist:

gregsheremeta avatar Jun 17 '24 00:06 gregsheremeta

Hi @gregsheremeta. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

google-oss-prow[bot] avatar Jun 17 '24 00:06 google-oss-prow[bot]

/ok-to-test

hbelmiro avatar Jun 17 '24 11:06 hbelmiro

Thanks for tackling this! It's an important gap between v1 and v2.

droctothorpe avatar Jun 17 '24 14:06 droctothorpe

@chensun can I ask for your review, or can you recommend someone to review? thanks.

gregsheremeta avatar Jun 18 '24 13:06 gregsheremeta

@gregsheremeta I can help with the backend review, but for SDK we have only @connor-mccarthy. I can review both and see how can we proceed with approvals.

rimolive avatar Jun 18 '24 14:06 rimolive

thanks. For what it's worth, the SDK piece is fairly simple, and Chen gave a preliminary lgtm over in #10892. It follows the (already approved and merged) example set in #10410. The key thing to review there is that I included the important fields. When you are doing an emptydir mount in kube, there are four pieces of information required (ref: https://kubernetes.io/docs/reference/kubernetes-api/config-and-storage-resources/volume/#local-temporary-directory) : name, mount path, medium (optional), and size limit (optional)

gregsheremeta avatar Jun 18 '24 15:06 gregsheremeta

actually, you may have been referring to the code in empty_dir.py :) In which case, kind of the same answer. It has the same pattern as (say) secrets but that I have these 4 fields instead.

gregsheremeta avatar Jun 18 '24 15:06 gregsheremeta

The failing tests are expected. When #10892 is merged, I'll rebase this, and then the tests will work

@chensun @james-jwu @zijianjoy @connor-mccarthy can I ask for your review, or can you recommend someone to review? thanks.

gregsheremeta avatar Jun 25 '24 21:06 gregsheremeta

/rerun-all

rimolive avatar Jul 02 '24 20:07 rimolive

/lgtm

hbelmiro avatar Jul 04 '24 18:07 hbelmiro

@chensun @connor-mccarthy I am not sure why these tests are failing, but I would appreciate your review of this feature, as not being able to mount emptyDir to tasks is preventing a number of people updating to KFP v2.

If you are wondering why this is important, it's because PyTorch expects there to be a tempfs mounted at /dev/shm, and the only way to ensure this in a Kubernetes Pod is to mount an emptyDir with medium="Memory" to it.

thesuperzapper avatar Jul 11 '24 16:07 thesuperzapper

@gregsheremeta FYI all the GitHub Actions tests are mandatory. Some of the Prow ones may be optional. You can check which of them are optional here. And some of them should have been already removed. There are several PRs to remove the ones that we migrated to GHA. They are waiting for review.

hbelmiro avatar Jul 11 '24 17:07 hbelmiro

@gregsheremeta: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
kubeflow-pipeline-e2e-test a8374c2758f9e772cf4b7fefab52a5cdf35351d3 link false /test kubeflow-pipeline-e2e-test
kfp-kubernetes-execution-tests a8374c2758f9e772cf4b7fefab52a5cdf35351d3 link false /test kfp-kubernetes-execution-tests
kubeflow-pipeline-upgrade-test 621469b3937d12b681de7822e5f2571cb841b0de link false /test kubeflow-pipeline-upgrade-test
kubeflow-pipelines-samples-v2 621469b3937d12b681de7822e5f2571cb841b0de link false /test kubeflow-pipelines-samples-v2

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

google-oss-prow[bot] avatar Jul 25 '24 21:07 google-oss-prow[bot]

Added some nitpick error handling and logging suggestions for the driver code. The Kubernetes Platform / SDK code and the corresponding tests LGTM.

DharmitD avatar Aug 16 '24 01:08 DharmitD

/lgtm

rimolive avatar Aug 16 '24 13:08 rimolive

/lgtm /approve

Thanks!

HumairAK avatar Aug 16 '24 14:08 HumairAK

@chensun this is ready for your final review / merge. Can you take a look?

gregsheremeta avatar Aug 16 '24 15:08 gregsheremeta

@chensun bumping :)

gregsheremeta avatar Aug 22 '24 20:08 gregsheremeta

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: chensun, HumairAK

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

google-oss-prow[bot] avatar Aug 23 '24 19:08 google-oss-prow[bot]

Congrats on the merge, @gregsheremeta. Thank you for tackling this meaningful contribution! 🔥

droctothorpe avatar Aug 23 '24 20:08 droctothorpe

this should now be available in KFP 2.3.0 / SDK 2.9.0

gregsheremeta avatar Sep 10 '24 20:09 gregsheremeta