aws-step-functions-data-science-sdk-python icon indicating copy to clipboard operation
aws-step-functions-data-science-sdk-python copied to clipboard

Feature request: allowing for Retry to work with SageMaker steps

Open lucaboulard opened this issue 3 years ago • 5 comments

Currently, the Retry mechanism does not work with TrainingStep and ProcessingStep as the full job name must be specified to the step constructor so that if the step fails when the job has already been created, all retries will fail in submitting the job as the job name has already been used. This happens for almost any error (including capacity errors) excluding throttling errors.

A possible solution might be to add an alternative parameter to specify a job name prefix, instead of a full name, and let SageMaker add some random suffix.

lucaboulard avatar Jun 07 '21 13:06 lucaboulard

Interesting, I think that's a feature the Step Functions or SageMaker service needs to support. Step Functions will retry with the same parameters.

A workaround that could be done today is to catch errors, go to another step that creates a new job name, then go back to the TrainingStep which reads the JobName from StepInput. Crude ASCII diagram:

                 -----> [Actual next state if successful]
                /
[TrainingStep] - Catch -> [Step That Generates New Job Name]
       ^                                        /
        \____________________________________ /

Or perhaps RetryCount from the Context Object could be used with States.Format to create a new job name on each retry:

  • https://docs.aws.amazon.com/step-functions/latest/dg/input-output-contextobject.html
  • https://docs.aws.amazon.com/step-functions/latest/dg/amazon-states-language-intrinsic-functions.html

wong-a avatar Jun 17 '21 22:06 wong-a

Any update?

rodrick10 avatar Mar 04 '22 18:03 rodrick10

I found a workaround for this. You can override the job_name via the parameters and use fields from the context object to generate a unique name even after retrying by including the retry count.

training_step = steps.TrainingStep(
    "Train Step",
    estimator=xgb,
    data={
        "train": sagemaker.TrainingInput(train_s3_file, content_type="application/x-parquet"),
        "validation": sagemaker.TrainingInput(validation_s3_file, content_type="application/x-parquet"),
    },
    job_name=ExecutionInput()["dummy"],
    parameters = {
        "TrainingJobName.$": "States.Format('{}-{}-{}', $$.StateMachine.Name, $$.Execution.Name, $$.State.RetryCount)",
    },
    retry=default_retryer,
)

Please note how I set the job_name to ExecutionInput()["dummy"], because its a mandatory field. But it will be overwritten with the TrainingJobName from the parameters.

lasdem avatar Sep 01 '22 11:09 lasdem

Is this feature being implemented?

I am facing the same issue, although not related to retry. If we set a string to the job_name, the created SFN can only be executed once. i.e. If you access AWS SFN UI, you cannot execute the created workflow again. Because the training step here will use the same job name every time.

Which means we can only use ExecutionInput at the moment, but it is not user-friendly because it is not necessary for the user to input the job name manually.

keithleungwork avatar Oct 18 '23 08:10 keithleungwork

Is this feature being implemented?

I am facing the same issue, although not related to retry. If we set a string to the job_name, the created SFN can only be executed once. i.e. If you access AWS SFN UI, you cannot execute the created workflow again. Because the training step here will use the same job name every time.

Which means we can only use ExecutionInput at the moment, but it is not user-friendly because it is not necessary for the user to input the job name manually.

In my example above the ExecutionInput is not used, it is literally a dummy. The important part is the section below

parameters = {
        "TrainingJobName.$": "States.Format('{}-{}-{}', $$.StateMachine.Name, $$.Execution.Name, $$.State.RetryCount)",
    },

Because here the TrainingJobName will be overwritten with whats provided here, which includes the name of the step function, the execution id and the retry count. This will generate a new name for every execution.

lasdem avatar Oct 19 '23 09:10 lasdem