zenml
zenml copied to clipboard
Enable orchestration environment restarts for dynamic pipelines
Describe changes
This PR adds support for restarting the orchestrator environment when running dynamic pipelines. For orchestrators to support this, they must implement restarts of the orchestration container, and additionally also make sure that the get_orchestrator_run_id() method returns the same value even after a restart. This is currently only implemented for the Kubernetes orchestrator.
Technical implementation details:
- When restarting the orchestration environment, we re-execute the pipeline function
- When a step function is executed, we first check if a step run for the given invocation ID already exists
- If a step run exists, we either return its results (in case it finished) or restart monitoring (in case it's still running). If the step is running in
inlinemode, we instead mark it as failed and potentially retry it. - If no step run exists, we run it as usual
- If a step run exists, we either return its results (in case it finished) or restart monitoring (in case it's still running). If the step is running in
Pre-requisites
Please ensure you have done the following:
- [ ] I have read the CONTRIBUTING.md document.
- [ ] I have added tests to cover my changes.
- [ ] I have based my new branch on
developand the open PR is targetingdevelop. If your branch wasn't based on develop read Contribution guide on rebasing branch to develop. - [ ] IMPORTANT: I made sure that my changes are reflected properly in the following resources:
- [ ] ZenML Docs
- [ ] Dashboard: Needs to be communicated to the frontend team.
- [ ] Templates: Might need adjustments (that are not reflected in the template tests) in case of non-breaking changes and deprecations.
- [ ] Projects: Depending on the version dependencies, different projects might get affected.
Types of changes
- [ ] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)
- [ ] Other (add details above)