pipelines [feature/question] Lightweight Python component module capture support in KFP SDK v2

Feature Area

/area sdk

Question

In the KFP SDK v1 using the func_to_container_op one could use the modules_to_capture parameter to add additional python modules that should be captured. This allowed one to reuse helper/utility functions across multiple components or provide modular design for complex components.

According to the documentation for lightweight python components in v1 and v2, the python function should be self contained, but v1 had a way around this using the modules_to_capture functionality.

For v2 the only workarounds I see:

Publish the python code (not contained in the function) to a repository and use the pip_index_urls currently supported by the component decorator in v2
Build and push a container image containing both the helper utility and component function (as demonstrated by the v2 docs for containerized python components to an image registry

For both these options one would need to push an image/package before being able to use the component. This causes friction during the component development process.

Could/will this functionality be added to KFP SDK v2? Is there another way to achieve this that I am missing? What is best practise for using helper functions across multiple components?

Love this idea? Give it a 👍.

Jul 20 '23 13:07 JacusSH

Hi, @JacusSH. Thanks for the question. For KFP SDK v2, the "workarounds" you describe are indeed the supported paths to the outcome you describe. There are currently no plans to support modules_to_capture though we're exploring some other improvements on the containerization experience that may make it to a future KFP SDK release.

Jul 20 '23 22:07 connor-mccarthy

Hi @connor-mccarthy. The current suggested solutions result in quite a lot of friction in the development process. Consider the situation where multiple components use a common module. If you create a new component that also depends on this common module, but requires some extra functionality, you are forced to update the common module first and upload it to a repository or container before you can experiment with using that module in a component.

Previously users of kubeflow would be able to create a small change in the common module and test it out in their component before committing to creating images or uploading to a remote repository. This was enabled via the modules_to_capture feature and was extremely useful.

It's very rare for components not to share modules with other components so I suspect this is a very common use-case. What was the reasoning behind removing support for the modules_to_capture feature?

Jul 24 '23 09:07 WalterSmuts

What was the reasoning behind removing support for the modules_to_capture feature?

Short answer is we weren't aware of active usage to this feature. As part of the ContainerOp deprecation announced in 2020 (https://github.com/kubeflow/pipelines/issues/4713), we've been recommending using create_component_from_from instead of func_to_container_op to avoid feature "duplication" between the two and the leaking of the word container_op in public interface.

Previously users of kubeflow would be able to create a small change in the common module and test it out in their component before committing to creating images or uploading to a remote repository. This was enabled via the modules_to_capture feature and was extremely useful.

@WalterSmuts, can you share a bit more details on how you test it out in a component before making it available from a remote image or module? There's a chance we might be able to close the feature gap via our upcoming local testing support.

Jul 24 '23 17:07 chensun

Consider the following setup:

walter@rocky:~/development/kubeflow-issue-9765$ tree
.
├── library.py
└── my_pipeline.py

1 directory, 2 files

walter@rocky:~/development/kubeflow-issue-9765$ cat library.py
MY_KUBEFLOW_HOST = "HOST_DOMAIN_NAME"


def super_useful_common_function():
    print("I am super useful")

walter@rocky:~/development/kubeflow-issue-9765$ cat my_pipeline.py
import kfp
import library


def my_component_op():
    def my_component():
        library.super_useful_common_function()

    kfp.components.func_to_container_op(
        my_component, use_code_pickling=True, modules_to_capture=["library"]
    )


def my_pipeline_func():
    my_component_op()


if __name__ == "__main__":
    client = kfp.Client(host=library.MY_KUBEFLOW_HOST)

    run_result = client.create_run_from_pipeline_func(
        pipeline_func=my_pipeline_func,
        arguments={},
    )

Now I want to add a new pipeline using the basic template of my_pipeline.py:

walter@rocky:~/development/kubeflow-issue-9765$ cp my_pipeline.py my_other_pipeline.py

I want to add a new feature to the super_useful_common_function but I don't want to change existing behavior(to keep backwards compatibility with other piplines), so I may do the following change:

walter@rocky:~/development/kubeflow-issue-9765$ git diff
diff --git a/library.py b/library.py
index 5b8be79..7e86686 100644
--- a/library.py
+++ b/library.py
@@ -1,5 +1,7 @@
 MY_KUBEFLOW_HOST = "HOST_DOMAIN_NAME"


-def super_useful_common_function():
+def super_useful_common_function(enable_new_feature: bool = False):
     print("I am super useful")
+    if enable_new_feature:
+        print("I am a super useful new feature")

The modules_to_capture feature means my local workspace is all I need to change to actually test out this new pipeline before pushing any changes and going through some code-review process:

walter@rocky:~/development/kubeflow-issue-9765$ git diff my_other_pipeline.py
diff --git a/my_other_pipeline.py b/my_other_pipeline.py
index 7981846..f21b467 100644
--- a/my_other_pipeline.py
+++ b/my_other_pipeline.py
@@ -4,7 +4,7 @@ import library

 def my_component_op():
     def my_component():
-        library.super_useful_common_function()
+        library.super_useful_common_function(enable_new_feature=True)

     kfp.components.func_to_container_op(
         my_component, use_code_pickling=True, modules_to_capture=["library"]

Jul 28 '23 13:07 WalterSmuts

We are spiking out KFP V2 and this is one feature our usecases are using extensively as well. Our usecases like to structure their code and test their code as well.

Without modules_to_capture, all the dependent functions needs to be inside of the component code which is ugly and violates the DRY principle because any small helper function would either need to be copied to individual component functions or the approaches mentioned in the original feature request needs to be done which is an overkill for such small helper functions.

We probably wouldn't be able to move to KFP v2/SDK v2 unless the usecases can somehow efficiently include dependent functions and modules at runtime.

Aug 14 '23 15:08 revolutionisme

Thank you for the feedback, @revolutionisme. We will keep this open and continue to explore the source code packaging options available, incorporating your feedback.

I do want to highlight for future viewers that this is not strictly true:

Without modules_to_capture, all the dependent functions needs to be inside of the component code

As you mention, Containerized Python Components, provide an alternative approach to this, though it is not as lightweight as the modules_to_capture approach, which comes with its own set of challenges.

Aug 14 '23 19:08 connor-mccarthy

After the Kubeflow Pipelines community call, I would like to add the details of how our users are using modules_tocapture and how moving to SDK v2 is a bit challenging for them.

We have many various kinds of users - From proper Data scientists, Data Engineer, and Developers to Research students and they all organise their code differently. With the current state of SDK v2, we have the following challenges:

Users now explicitly need a docker image builder to be installed for building and pushing the images to the container registry. This is not always feasible because of restrictions, licensing, different environments (Windows etc. )
Users also need to organise all the code in specific folders so that they can be included in the image building process, this introduces additional challenges
- Imagine the user needs to have more than 1 component which uses some util/helper module as shown below:

├── components
│   ├── step1
│   │   ├── module1.py
│   ├── step2
│   │   ├── module2.py
│   ├── utils
│   │   ├── util.py
└── pipeline.py

In this case, if we want to create different images for different components, we can't because as specified here we would have to copy the util module to each folder individually (unless we write a custom logic to copy at compile time). On the other hand if we build components with the source folder on the components folder, every component would include code for the other component steps as well.

Adding to the point before, Users can't use configuration or files from other parts of the codebase.
This implicitly increases the registry costs as for each version of the component a new image would be created, and this exponentially increases when we consider having multiple pipelines being used by different users of the platform per namespace.

Oct 12 '23 15:10 revolutionisme

Thanks for raising this issue.

Oct 26 '23 08:10 kabartay

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Jan 25 '24 07:01 github-actions[bot]

We have similar issues to those raised by @revolutionisme, we are looking to create a structure like so:

├── pipelines/
│   ├── pipeline_1/
│   │   ├── components.py
│   │   └── pipeline.py
│   ├── pipeline_2/
│   │   ├── components.py
│   │   └── pipeline.py
│   └── utils/
│       └── helper_functions.py
└── constants.py

Where the components.py files make use of helper functions defined in utils/helper_functions.py.

Something we are struggling with is importing base images we have defined in constants.py into our component decorators when running the build command from the root directory. This is the command we use for building pipeline 1:

kfp component build . --component-filepattern="pipelines/pipeline_1/components.py" --push-image

During the build, we get the error "No module named 'constants'" from the following import specified in pipelines/pipeline_1/components.py:

from constants import BASE_IMAGE

From our understanding, the reason this doesn't work is because this function in the build script (kubeflow GitHub) loads the components within scope of their immediate parent directory instead of the directory provided to the build command (i.e. the root directory in our case).

Could someone help us understand why it works like this as it seems counterintuitive to us giving the build command the root directory in the first place? (furthermore, these imports are fine at runtime, since the target image gets a full copy of the root directory, they just don't work when trying to build the target images)

Jan 25 '24 15:01 Levatius

@Levatius, does including an __init__.py file in the directory resolve this issue?

Feb 13 '24 19:02 connor-mccarthy

It does not unfortunately, I have tried various configurations of adding the __init__.py to the directories but still see:

No module named 'constants'

Feb 14 '24 10:02 Levatius

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Apr 15 '24 07:04 github-actions[bot]

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

May 06 '24 07:05 github-actions[bot]

/reopen

Jul 15 '24 16:07 yaoman3

@yaoman3: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Jul 15 '24 16:07 google-oss-prow[bot]

pipelines pipelines copied to clipboard

[feature/question] Lightweight Python component module capture support in KFP SDK v2

Feature Area

Question

pipelines
pipelines copied to clipboard