[feature] Avoiding busy waiting for long running external tasks
Feature Area
/area backend /area sdk
What feature would you like to see?
Kubeflow Pipelines does not support a way to intermittently poll for completion of longer running tasks on an external system. Request: Create an op that allows for intermittent polling or execution to determine if an external task has finished.
What is the use case or pain point?
The use case is around model training. KFP provides HPO execution option that launches a job via Katib, however, it busy waits for completion of the job.
This is problematic for a few reasons:
- waste of compute resources and cost since such jobs can take days or weeks to complete depending on the type of model and task.
- KFP hosted on transient/preemptible environments can lose track of the experiment if the pod is evicted or killed.
- it is hard for end users to determine max busy wait time via the current launcher
Is there a workaround currently?
Currently, we do not use pipelines to launch Katib jobs but have our own wrappers to manage this execution. Another option is to use the provided launcher but we have noticed that we lose track of the experiment if the node hosting the launcher gets preemptied.
Love this idea? Give it a 👍. We prioritize fulfilling features with the most 👍.