aws-step-functions-data-science-sdk-python
aws-step-functions-data-science-sdk-python copied to clipboard
Feature request: allowing for Retry to work with SageMaker steps
Currently, the Retry
mechanism does not work with TrainingStep
and ProcessingStep
as the full job name must be specified to the step constructor so that if the step fails when the job has already been created, all retries will fail in submitting the job as the job name has already been used.
This happens for almost any error (including capacity errors) excluding throttling errors.
A possible solution might be to add an alternative parameter to specify a job name prefix, instead of a full name, and let SageMaker add some random suffix.
Interesting, I think that's a feature the Step Functions or SageMaker service needs to support. Step Functions will retry with the same parameters.
A workaround that could be done today is to catch errors, go to another step that creates a new job name, then go back to the TrainingStep which reads the JobName from StepInput. Crude ASCII diagram:
-----> [Actual next state if successful]
/
[TrainingStep] - Catch -> [Step That Generates New Job Name]
^ /
\____________________________________ /
Or perhaps RetryCount
from the Context Object could be used with States.Format
to create a new job name on each retry:
- https://docs.aws.amazon.com/step-functions/latest/dg/input-output-contextobject.html
- https://docs.aws.amazon.com/step-functions/latest/dg/amazon-states-language-intrinsic-functions.html
Any update?
I found a workaround for this. You can override the job_name via the parameters and use fields from the context object to generate a unique name even after retrying by including the retry count.
training_step = steps.TrainingStep(
"Train Step",
estimator=xgb,
data={
"train": sagemaker.TrainingInput(train_s3_file, content_type="application/x-parquet"),
"validation": sagemaker.TrainingInput(validation_s3_file, content_type="application/x-parquet"),
},
job_name=ExecutionInput()["dummy"],
parameters = {
"TrainingJobName.$": "States.Format('{}-{}-{}', $$.StateMachine.Name, $$.Execution.Name, $$.State.RetryCount)",
},
retry=default_retryer,
)
Please note how I set the job_name to ExecutionInput()["dummy"]
, because its a mandatory field. But it will be overwritten with the TrainingJobName from the parameters.
Is this feature being implemented?
I am facing the same issue, although not related to retry.
If we set a string to the job_name
, the created SFN can only be executed once.
i.e. If you access AWS SFN UI, you cannot execute the created workflow again. Because the training step here will use the same job name every time.
Which means we can only use ExecutionInput
at the moment, but it is not user-friendly because it is not necessary for the user to input the job name manually.
Is this feature being implemented?
I am facing the same issue, although not related to retry. If we set a string to the
job_name
, the created SFN can only be executed once. i.e. If you access AWS SFN UI, you cannot execute the created workflow again. Because the training step here will use the same job name every time.Which means we can only use
ExecutionInput
at the moment, but it is not user-friendly because it is not necessary for the user to input the job name manually.
In my example above the ExecutionInput is not used, it is literally a dummy. The important part is the section below
parameters = {
"TrainingJobName.$": "States.Format('{}-{}-{}', $$.StateMachine.Name, $$.Execution.Name, $$.State.RetryCount)",
},
Because here the TrainingJobName will be overwritten with whats provided here, which includes the name of the step function, the execution id and the retry count. This will generate a new name for every execution.