sagemaker-python-sdk icon indicating copy to clipboard operation
sagemaker-python-sdk copied to clipboard

Cannot pass preprocessing script uri as a pipeline parameter

Open nadasaiyed opened this issue 2 years ago • 1 comments

I want to pass the preprocessing script uri as a pipeline parameter and use it in ProcessingStep so that i can specify the script at execution time. However, it throws a ValueError: code argument has to be a valid S3 URI or local file path rather than a pipeline variable

I think it evaluates to True here: https://github.com/aws/sagemaker-python-sdk/blob/6d7bfc45be7883397cfff1ffbab9231a987c0f95/src/sagemaker/processing.py#L234

Code tried:

from sagemaker.workflow.parameters import (
    ParameterInteger,
    ParameterString,
    ParameterFloat
)

input_data = ParameterString(name="InputData", default_value=f"s3://{artifacts_bucket}/{prefix}/data/{file}")

preprocess_script = ParameterString(
    name="PreprocessScript", default_value=f"s3://{artifacts_bucket}/{prefix}/scripts/preprocess.py"
)

from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.workflow.steps import ProcessingStep
from sagemaker.workflow.functions import Join
from sagemaker.workflow.execution_variables import ExecutionVariables

# Create SKlearn processor object,
# The object contains information about what instance type to use, the IAM role to use etc.
# A managed processor comes with a preconfigured container, so only specifying version is required.
sklearn_processor = SKLearnProcessor(
    framework_version="0.23-1", role=role, instance_type="ml.t3.medium", instance_count=1
)

prefix = f"{prefix}/data"
print("Creating processing step")
# Use the sklearn_processor in a SageMaker Pipelines ProcessingStep
step_preprocess_data = ProcessingStep(
    name="Preprocess-Data",
    processor=sklearn_processor,
    inputs=[
        ProcessingInput(source=input_data, destination="/opt/ml/processing/input"),
    ],
    outputs=[
        ProcessingOutput(
            output_name="train",
            source="/opt/ml/processing/train",
            destination=Join(
                on="/",
                values=[
                    f"s3://{artifacts_bucket}",
                    prefix,
                    ExecutionVariables.PIPELINE_EXECUTION_ID,
                    "train",
                ],
            ),
        ),
        ProcessingOutput(
            output_name="validation",
            source="/opt/ml/processing/validation",
            destination=Join(
                on="/",
                values=[
                    f"s3://{artifacts_bucket}",
                    prefix,
                    ExecutionVariables.PIPELINE_EXECUTION_ID,
                    "validation",
                ],
            ),
        ),
        ProcessingOutput(
            output_name="test",
            source="/opt/ml/processing/test",
            destination=Join(
                on="/",
                values=[
                    f"s3://{artifacts_bucket}",
                    prefix,
                    ExecutionVariables.PIPELINE_EXECUTION_ID,
                    "test",
                ],
            ),
        ),
    ],
    code=preprocess_script
    job_arguments=["--train-val-test-split-ratio", "0.2", "--input-file", file]
)

from sagemaker.workflow.pipeline import Pipeline

pipeline_params= [
        input_data,
        preprocess_script
    ]

pipeline = Pipeline(
    name=pipeline_name,
    parameters=pipeline_params,
    steps=[step_preprocess_data]
)

import json
print(json.dumps(json.loads(pipeline.definition()), indent=2))

nadasaiyed avatar May 31 '22 00:05 nadasaiyed

Hi @nadasaiyed , thanks for reaching out! Sorry that we don't allow parameterized code for ProcessingStep at this point that's why you saw the error thrown. The reason is to support parameterized code, changes on Sagemaker Pipeline is not enough and we also need Processing Job to update their logics to support, which can take time. I'll mark this issue as feature request and get back to you once we have any updates.

qidewenwhen avatar Jun 01 '22 20:06 qidewenwhen

Hello @nadasaiyed, is this https://github.com/aws/sagemaker-python-sdk/issues/3069 what you were looking for?

bandhakavi avatar Sep 27 '22 15:09 bandhakavi

Hi @nadasaiyed,

Thanks for using SageMaker and taking the time to suggest ways to improve SageMaker Python SDK. We have added your feature request it to our backlog of feature requests and may consider putting it into future SDK versions. I will go ahead and close the issue now, please let me know if you have any more feedback. Let me know if you have any other questions.

Best, Shweta

ShwetaSingh801 avatar Dec 22 '23 08:12 ShwetaSingh801