pipelines icon indicating copy to clipboard operation
pipelines copied to clipboard

[feature] Expose ExecutionConfig.ttl property to DataprocPySparkBatchOp component

Open rfbigelow opened this issue 2 years ago • 2 comments

Feature Area

/area components

What feature would you like to see?

Add a new ttlparameter to dataproc_create_pyspark_batch to allow pipelines to specify this timeout value when using the DataprocPySparkBatchOp component.

What is the use case or pain point?

In dataproc 1.1 and 2.0 this value was unbounded, allowing long-running batch jobs to execute. In dataproc 2.1 this value defaults to 4 hours. If a pipeline was previously using the DataprocPySparkBatchOp component and upgrades to dataproc 2.1, and the batch job takes longer than 4 hours to run, then it will not complete. Instead the batch will be cancelled. When this cancellation happens the kfp pipeline reports that the component stage completed successfully, which is not the case.

Is there a workaround currently?

The workaround is to use dataproc 2.0 to execute the batch.


Love this idea? Give it a 👍.

rfbigelow avatar Oct 31 '23 21:10 rfbigelow

/assign @connor-mccarthy

zijianjoy avatar Nov 02 '23 22:11 zijianjoy

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Feb 01 '24 07:02 github-actions[bot]

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

github-actions[bot] avatar Mar 14 '24 07:03 github-actions[bot]

/reopen

This issue is still affecting us, and preventing usage of runtime versions > 2.0.

rfbigelow avatar Mar 21 '24 00:03 rfbigelow

@rfbigelow: Reopened this issue.

In response to this:

/reopen

This issue is still affecting us, and preventing usage of runtime versions > 2.0.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

google-oss-prow[bot] avatar Mar 21 '24 00:03 google-oss-prow[bot]

So this essentially makes that version of the component unusable for any real world jobs?

jeffkpayne avatar Mar 21 '24 16:03 jeffkpayne

So this essentially makes that version of the component unusable for any real world jobs?

Not quite that bad, but it does rule out long-running batch jobs (> 4 hours) running on Dataproc 2.1+ being launched from kfp.

rfbigelow avatar Mar 21 '24 22:03 rfbigelow

@connor-mccarthy It looks like this can be implemented in the container component file. No changes to the container image are needed.

rfbigelow avatar Mar 21 '24 22:03 rfbigelow

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar May 21 '24 07:05 github-actions[bot]

We ran into this issue too and we worked around it.

Based on this comment, we decided to take the code from the file and create our own custom version to test with.

@connor-mccarthy It looks like this can be implemented in the container component file. No changes to the container image are needed.

def dataproc_create_pyspark_batch_ttl(
  • Added a parameter for ttl
    ...
    network_uri: str = '',
    subnetwork_uri: str = '',
    ttl: str = '',
    metastore_service: str = '',
    ...
  • Updated the ConcatPlaceholder with a ttl property under executionconfig and populated with the value of the parameter
              ...
              '"execution_config": {',
              '"service_account": "',
              service_account,
              '"',
              '"ttl": "',
              ttl,
              '"',
              ', "network_tags": ',
              network_tags,
              ', "kms_key": "',
              kms_key,
              '"',
              ...
  • Ran a few tests and the ttl configuration was picked up and used We call the method directly instead of through the name set in the __init__.

Tried checking how to contribute this back, but it's too cumbersome to spend the time on it. If someone else feels compelled to jump through the hoops, feel free. Otherwise i hope this helps in at least fixing it locally for your project(s).

marnux avatar Jun 03 '24 18:06 marnux

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Aug 03 '24 07:08 github-actions[bot]