sagemaker-python-sdk
sagemaker-python-sdk copied to clipboard
Cannot pass preprocessing script uri as a pipeline parameter
I want to pass the preprocessing script uri as a pipeline parameter and use it in ProcessingStep
so that i can specify the script at execution time.
However, it throws a ValueError: code argument has to be a valid S3 URI or local file path rather than a pipeline variable
I think it evaluates to True
here: https://github.com/aws/sagemaker-python-sdk/blob/6d7bfc45be7883397cfff1ffbab9231a987c0f95/src/sagemaker/processing.py#L234
Code tried:
from sagemaker.workflow.parameters import (
ParameterInteger,
ParameterString,
ParameterFloat
)
input_data = ParameterString(name="InputData", default_value=f"s3://{artifacts_bucket}/{prefix}/data/{file}")
preprocess_script = ParameterString(
name="PreprocessScript", default_value=f"s3://{artifacts_bucket}/{prefix}/scripts/preprocess.py"
)
from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.workflow.steps import ProcessingStep
from sagemaker.workflow.functions import Join
from sagemaker.workflow.execution_variables import ExecutionVariables
# Create SKlearn processor object,
# The object contains information about what instance type to use, the IAM role to use etc.
# A managed processor comes with a preconfigured container, so only specifying version is required.
sklearn_processor = SKLearnProcessor(
framework_version="0.23-1", role=role, instance_type="ml.t3.medium", instance_count=1
)
prefix = f"{prefix}/data"
print("Creating processing step")
# Use the sklearn_processor in a SageMaker Pipelines ProcessingStep
step_preprocess_data = ProcessingStep(
name="Preprocess-Data",
processor=sklearn_processor,
inputs=[
ProcessingInput(source=input_data, destination="/opt/ml/processing/input"),
],
outputs=[
ProcessingOutput(
output_name="train",
source="/opt/ml/processing/train",
destination=Join(
on="/",
values=[
f"s3://{artifacts_bucket}",
prefix,
ExecutionVariables.PIPELINE_EXECUTION_ID,
"train",
],
),
),
ProcessingOutput(
output_name="validation",
source="/opt/ml/processing/validation",
destination=Join(
on="/",
values=[
f"s3://{artifacts_bucket}",
prefix,
ExecutionVariables.PIPELINE_EXECUTION_ID,
"validation",
],
),
),
ProcessingOutput(
output_name="test",
source="/opt/ml/processing/test",
destination=Join(
on="/",
values=[
f"s3://{artifacts_bucket}",
prefix,
ExecutionVariables.PIPELINE_EXECUTION_ID,
"test",
],
),
),
],
code=preprocess_script
job_arguments=["--train-val-test-split-ratio", "0.2", "--input-file", file]
)
from sagemaker.workflow.pipeline import Pipeline
pipeline_params= [
input_data,
preprocess_script
]
pipeline = Pipeline(
name=pipeline_name,
parameters=pipeline_params,
steps=[step_preprocess_data]
)
import json
print(json.dumps(json.loads(pipeline.definition()), indent=2))
Hi @nadasaiyed , thanks for reaching out!
Sorry that we don't allow parameterized code
for ProcessingStep at this point that's why you saw the error thrown.
The reason is to support parameterized code
, changes on Sagemaker Pipeline is not enough and we also need Processing Job to update their logics to support, which can take time.
I'll mark this issue as feature request and get back to you once we have any updates.
Hello @nadasaiyed, is this https://github.com/aws/sagemaker-python-sdk/issues/3069 what you were looking for?
Hi @nadasaiyed,
Thanks for using SageMaker and taking the time to suggest ways to improve SageMaker Python SDK. We have added your feature request it to our backlog of feature requests and may consider putting it into future SDK versions. I will go ahead and close the issue now, please let me know if you have any more feedback. Let me know if you have any other questions.
Best, Shweta