pipelines [feature] Expose ExecutionConfig.ttl property to DataprocPySparkBatchOp component

Feature Area

/area components

What feature would you like to see?

Add a new ttlparameter to dataproc_create_pyspark_batch to allow pipelines to specify this timeout value when using the DataprocPySparkBatchOp component.

What is the use case or pain point?

In dataproc 1.1 and 2.0 this value was unbounded, allowing long-running batch jobs to execute. In dataproc 2.1 this value defaults to 4 hours. If a pipeline was previously using the DataprocPySparkBatchOp component and upgrades to dataproc 2.1, and the batch job takes longer than 4 hours to run, then it will not complete. Instead the batch will be cancelled. When this cancellation happens the kfp pipeline reports that the component stage completed successfully, which is not the case.

Is there a workaround currently?

The workaround is to use dataproc 2.0 to execute the batch.

Love this idea? Give it a 👍.

Oct 31 '23 21:10 rfbigelow

/assign @connor-mccarthy

Nov 02 '23 22:11 zijianjoy

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Feb 01 '24 07:02 github-actions[bot]

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

Mar 14 '24 07:03 github-actions[bot]

/reopen

This issue is still affecting us, and preventing usage of runtime versions > 2.0.

Mar 21 '24 00:03 rfbigelow

@rfbigelow: Reopened this issue.

In response to this:

/reopen

This issue is still affecting us, and preventing usage of runtime versions > 2.0.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Mar 21 '24 00:03 google-oss-prow[bot]

So this essentially makes that version of the component unusable for any real world jobs?

Mar 21 '24 16:03 jeffkpayne

So this essentially makes that version of the component unusable for any real world jobs?

Not quite that bad, but it does rule out long-running batch jobs (> 4 hours) running on Dataproc 2.1+ being launched from kfp.

Mar 21 '24 22:03 rfbigelow

@connor-mccarthy It looks like this can be implemented in the container component file. No changes to the container image are needed.

Mar 21 '24 22:03 rfbigelow

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

May 21 '24 07:05 github-actions[bot]

We ran into this issue too and we worked around it.

Based on this comment, we decided to take the code from the file and create our own custom version to test with.

@connor-mccarthy It looks like this can be implemented in the container component file. No changes to the container image are needed.

Copied the code to the file where we needed it
Renamed the dataproc_create_pyspark_batch method

def dataproc_create_pyspark_batch_ttl(

Added a parameter for ttl

    ...
    network_uri: str = '',
    subnetwork_uri: str = '',
    ttl: str = '',
    metastore_service: str = '',
    ...

Updated the ConcatPlaceholder with a ttl property under executionconfig and populated with the value of the parameter

              ...
              '"execution_config": {',
              '"service_account": "',
              service_account,
              '"',
              '"ttl": "',
              ttl,
              '"',
              ', "network_tags": ',
              network_tags,
              ', "kms_key": "',
              kms_key,
              '"',
              ...

Ran a few tests and the ttl configuration was picked up and used We call the method directly instead of through the name set in the __init__.

Tried checking how to contribute this back, but it's too cumbersome to spend the time on it. If someone else feels compelled to jump through the hoops, feel free. Otherwise i hope this helps in at least fixing it locally for your project(s).

Jun 03 '24 18:06 marnux

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Aug 03 '24 07:08 github-actions[bot]