pipelines feat(components): add labels to gcp ml engine train/batch

Description of your changes: Add labels to organize jobs on ml engine

Checklist:

[x] The title for your pull request (PR) should follow our title convention. Learn more about the pull request title convention used in this repository.

[ ] Do you want this pull request (PR) cherry-picked into the current release branch?

Learn more about cherry-picking updates into the release branch.

Mar 05 '21 10:03 dkajtoch

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: dkajtoch To complete the pull request process, please assign ark-kun after the PR has been reviewed. You can assign the PR to them by writing /assign @ark-kun in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

components/OWNERS

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Mar 05 '21 10:03 google-oss-robot

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

:memo: Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.

What to do if you already signed the CLA

Individual signers

It's possible we don't have your GitHub username or you're using a different email address on your commit. Check your existing CLA data and verify that your email is set on your git commits.

Corporate signers

Your company has a Point of Contact who decides which employees are authorized to participate. Ask your POC to be added to the group of authorized contributors. If you don't know who your Point of Contact is, direct the Google project maintainer to go/cla#troubleshoot (Public version).
The email used to register you as an authorized contributor must be the email used for the Git commit. Check your existing CLA data and verify that your email is set on your git commits.
The email used to register you as an authorized contributor must also be attached to your GitHub account.

ℹ️ Googlers: Go here for more info.

Mar 05 '21 10:03 google-cla[bot]

Hi @dkajtoch. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Mar 05 '21 10:03 google-oss-robot

/assign @ark-kun

Mar 05 '21 10:03 dkajtoch

@googlebot I signed it!

Mar 05 '21 11:03 dkajtoch

How did you verify that the updated component code works?

Mar 09 '21 00:03 Ark-kun

/ok-to-test

Mar 09 '21 00:03 Ark-kun

@Ark-kun I changed return type to JsonObject and updated ml engine train test

I build docker image locally and run some poc pipeline with single task to test if labels appear and everything works (I have access to google cloud).

import kfp
from kfp import dsl

task_factory = kfp.components.load_component_from_file("component.yaml")


@dsl.pipeline(
    name='ml-train',
    description='ML train pipeline.'
)
def poc_pipeline():
    task = task_factory(
        project_id='project-id',
        region='europe-west4',
        args=[
            '--preprocess',
            '--model_type', 'classification',
            '--batch_size', '250',
            '--learning_rate', '0.1',
            '--max_steps', '1',
            '--training_data_path',
            'gs://cloud-samples-data/ai-platform/census/algorithms/data/train.csv'
        ],
        job_dir='gs://my-test-bucket',
        job_id_prefix='job-',
        job_id='job-1',
        labels={
            'env': 'dev',
            'user': 'test',
        },
        master_image_uri='gcr.io/cloud-ml-algos/linear_learner_cpu:latest',
    )

Mar 09 '21 14:03 dkajtoch

Interesting. I do not fully understand why this code works. The command-line receives the labels as a JSON-serialized string. The train function has an untyped parameter labels which is treated as a dictionary. Where is the JSON stricg converted to dictionary?

Mar 17 '21 08:03 Ark-kun

@Ark-kun I need to check it once again since I am passing Dict in tests and in pipeline so maybe there is no conversion. BTW in ml engine train yaml we have List as args so do not know why json.dumps is ok here https://github.com/kubeflow/pipelines/blob/master/samples/core/ai_platform/ai_platform.ipynb (I pass list directly and it works)

Mar 17 '21 16:03 dkajtoch

BTW in ml engine train yaml we have List as args so do not know why json.dumps is ok here

When you pass a string, the SDK assumes that you've serialized the argument yourself. Otherwise it will try to serialize the argument based on the input type. This only works for a small number of types: int, float, bool, list (==JsonArray) and dict (==JsonObject).

In the end the command line is always an array of strings when being executed.

I'd expect that the train function receives the labels as a JSON string. But then it's strange that train backend does not complain about the string instead of map.

Mar 19 '21 05:03 Ark-kun

@Ark-kun sorry for such a delay. I did some tests with labels and it seems to work both with Dict and Json string. I added logger to _train.py in order to see what type I have in train(for labels) function. I pass labels to poc_pipeline in two forms:

labels = {"name": "a", "env": "dev"}  # dict
labels = json.dumps({"name": "a", "env": "dev"})  # string

In both cases, type is Dict.

I am running kubeflow pipelines via gcp ai platform version 1.0.4

Mar 30 '21 11:03 dkajtoch

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Jun 28 '21 12:06 stale[bot]

Thank you for verifying.

I did some investigation and it looks like the Fire library is doing this. It tries to parse every argument by adding missing quotes then parsing it as python code. https://github.com/google/python-fire/blob/c1266d0dbb2114514fcf8be62044344b5a51c733/fire/parser.py#L78 Since most valid JSON is also valid python (not always), this usually works.

I do not really like this - we're basically parsing JSON as if it was Python.

Could you please add explicit parsing to train and batch_predict?

import json

...

@decorators.SetParseFns(python_version=str, runtime_version=str, labels=json.loads)
def train(project_id,


import json
from fire import decorators
...
@decorators.SetParseFns(labels=json.loads)
def batch_predict

Jul 11 '21 01:07 Ark-kun

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Mar 02 '22 12:03 stale[bot]

feat(components): add labels to gcp ml engine train/batch_predict jobs. Fixes #5038

What to do if you already signed the CLA

Individual signers

Corporate signers