aws-step-functions-data-science-sdk-python
aws-step-functions-data-science-sdk-python copied to clipboard
chore: Add retry to pipeline templates constructors to add retrier to each pipeline step
Description
Fix build failures due to Sagemaker ThrottlingException when running pipeline integration tests
Fixes #(issue) - N/A
Why is the change necessary?
Recent build failures were due to Sagemaker ThrottlingException (Rate exceeded) during following tests:
- test_training_pipeline_estimators.py::test_pca_estimator
- test_training_pipeline_framework_estimator.py::test_torch_training_pipeline
- test_inference_pipeline.py::test_inference_pipeline_framework
Solution
Add an optional retry argument to the pipeline template constructors (InferencePipeline and TrainingPipeline) in order to add a retry strategy for each pipeline steps. The same retrier will be added for each step.
Caveat: This fix applies the retry strategy to all steps in the pipeline. The customer won't be able to customize the strategy for each step.
Alternate solution 1:
We could add the option for the client to customize retry strategies for each pipeline step by accepting a dict
, in addition to accepting Retry object.
Caveat: The retry strategy dict keys must correspond exactly to the step variable names - A validation step could be added to warn the customer of any unrecognized keys.
For example:
retry_strategy_per_step = {
'training_step': <training_retry_strategy>,
'model_step': <model_retry_strategy>,
'endpoint_config_step': <endpoint_config_retry_strategy>,
'deploy_step': <deploy_retry_strategy>
}
If a dict
is received, only add retriers to steps with defined strategies in that dict.
Alternate solution 2:
Only add retries to integration tests by updating the pipeline workflow with the added retries
# Once pipeline is created do something like:
sagemaker_retry_strategy = Retry(
error_equals=["SageMaker.AmazonSageMakerException"],
interval_seconds=5,
max_attempts=5,
backoff_rate=2
)
steps = pipeline.workflow.definition.branch.steps
for step in steps:
step.add_retry(sagemaker_retry_strategy)
pipeline.workflow.update(Chain(steps))
Caveat: If the fix is only applied to the integration tests, customers who want to add retry strategies to the pipeline steps will have to do this each time they are creating a pipeline
Testing
- Updated integ test and added unit test
- Generated doc locally
Pull Request Checklist
Please check all boxes (including N/A items)
Testing
- [X] Unit tests added
- [X] Integration test added
- [X] Manual testing - why was it necessary? could it be automated? - N/A
Documentation
- [X] docs: All relevant docs updated
- [X] docstrings: All public APIs documented
Title and description
- [X] Change type: Title is prefixed with change type: and follows conventional commits
- [X] References: Indicate issues fixed via:
Fixes #xxx
- N/A
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license.
This is a feature, not a chore. Adding new functionality should be motivated from the customer's POV, not just to fix the tests. That being said, I see some value here. Ideally, we should've had retries added by default. But it's still possible to add retriers by updating the Chain directly, right? Do we have any open issues related to this?
It would be nice to add a preconfigured retry strategy like you defined in the tests. It's not uncommon for SDKs to have default and reusable retry strategies. Customers using the pipeline classes probably don't want to deal with much of the lower level ASL constructs.
AWS CodeBuild CI Report
- CodeBuild project: AutoBuildProject6AEA49D1-sEHrOdk7acJc
- Commit ID: aea996cbe762a24d549734bd90089993a1020d7b
- Result: SUCCEEDED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository