[feature] Expose ExecutionConfig.ttl property to DataprocPySparkBatchOp component
Feature Area
/area components
What feature would you like to see?
Add a new ttlparameter to dataproc_create_pyspark_batch to allow pipelines to specify this timeout value when using the DataprocPySparkBatchOp component.
What is the use case or pain point?
In dataproc 1.1 and 2.0 this value was unbounded, allowing long-running batch jobs to execute. In dataproc 2.1 this value defaults to 4 hours. If a pipeline was previously using the DataprocPySparkBatchOp component and upgrades to dataproc 2.1, and the batch job takes longer than 4 hours to run, then it will not complete. Instead the batch will be cancelled. When this cancellation happens the kfp pipeline reports that the component stage completed successfully, which is not the case.
Is there a workaround currently?
The workaround is to use dataproc 2.0 to execute the batch.
Love this idea? Give it a 👍.
/assign @connor-mccarthy
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.
/reopen
This issue is still affecting us, and preventing usage of runtime versions > 2.0.
@rfbigelow: Reopened this issue.
In response to this:
/reopen
This issue is still affecting us, and preventing usage of runtime versions > 2.0.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
So this essentially makes that version of the component unusable for any real world jobs?
So this essentially makes that version of the component unusable for any real world jobs?
Not quite that bad, but it does rule out long-running batch jobs (> 4 hours) running on Dataproc 2.1+ being launched from kfp.
@connor-mccarthy It looks like this can be implemented in the container component file. No changes to the container image are needed.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
We ran into this issue too and we worked around it.
Based on this comment, we decided to take the code from the file and create our own custom version to test with.
@connor-mccarthy It looks like this can be implemented in the container component file. No changes to the container image are needed.
- Copied the code to the file where we needed it
- Renamed the dataproc_create_pyspark_batch method
def dataproc_create_pyspark_batch_ttl(
- Added a parameter for ttl
...
network_uri: str = '',
subnetwork_uri: str = '',
ttl: str = '',
metastore_service: str = '',
...
- Updated the ConcatPlaceholder with a ttl property under executionconfig and populated with the value of the parameter
...
'"execution_config": {',
'"service_account": "',
service_account,
'"',
'"ttl": "',
ttl,
'"',
', "network_tags": ',
network_tags,
', "kms_key": "',
kms_key,
'"',
...
- Ran a few tests and the ttl configuration was picked up and used We call the method directly instead of through the name set in the __init__.
Tried checking how to contribute this back, but it's too cumbersome to spend the time on it. If someone else feels compelled to jump through the hoops, feel free. Otherwise i hope this helps in at least fixing it locally for your project(s).
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.