sagemaker-python-sdk icon indicating copy to clipboard operation
sagemaker-python-sdk copied to clipboard

ScriptProcessor does not check local_code config before uploading code to S3

Open lodo1995 opened this issue 2 years ago • 4 comments

Describe the bug When a LocalSession or LocalPipelineSession is configured to use local code, as follows

session.config = {'local': {'local_code': True}}

the code passed to a pipeline ProcessingStep or directly to the run method of a processor (ScriptProcessor, FrameworkProcessor, ...) should not be uploaded to S3.

However, ScriptProcessor does not honor this. Its _include_code_in_inputs method (which is called unconditionally by the _normalize_args of the base class Processor, which in turn is called both when running directly and through a pipeline) unconditionally tries to upload the code to S3. https://github.com/aws/sagemaker-python-sdk/blob/554952eac259979dc714a1a9002653ced342b876/src/sagemaker/processing.py#L625

Compare this to the Model class, used for example in the TrainingStep. Its _upload_code method checks the session configuration and does not upload to S3 when local code is enabled. https://github.com/aws/sagemaker-python-sdk/blob/554952eac259979dc714a1a9002653ced342b876/src/sagemaker/model.py#L532

To reproduce In the absence of any AWS credentials (which should not be needed when running completely locally), the following code will fail to upload the processing.py script to S3 (botocore.exceptions.NoCredentialsError). Note that, in addition to the following code, a processing.py file must exist in the working directory (but its contents don't matter).

Code
import boto3
import sagemaker
from sagemaker.workflow.pipeline import Pipeline
from sagemaker.workflow.pipeline_context import LocalPipelineSession
from sagemaker.processing import ProcessingInput, ProcessingOutput, ScriptProcessor
from sagemaker.workflow.steps import ProcessingStep

role = 'arn:aws:iam::123456789012:role/MyRole'

local_pipeline_session = LocalPipelineSession(boto_session = boto3.Session(region_name = 'eu-west-1'))
local_pipeline_session.config = {'local': {'local_code': True}}

script_processor = ScriptProcessor(
    image_uri = 'docker.io/library/python:3.8',
    command = ['python'],
    instance_type = 'local',
    instance_count = 1,
    sagemaker_session = local_pipeline_session,
    role = role,
)

processing_step = ProcessingStep(
    name = 'Processing Step',
    processor = script_processor,
    code = 'processing.py',
    inputs = [
        ProcessingInput(
            source = './input-data',
            destination = '/opt/ml/processing/input',
        )
    ],
    outputs = [
        ProcessingOutput(
            source = '/opt/ml/processing/output',
            destination = './output-data',
        )
    ],
)

pipeline = Pipeline(
    name = 'MyPipeline',
    steps = [processing_step],
    sagemaker_session = local_pipeline_session
)

pipeline.upsert(role_arn = role)

pipeline_run = pipeline.start()

System information A description of your system. Please provide:

  • SageMaker Python SDK version: 2.126.0

lodo1995 avatar Dec 27 '22 16:12 lodo1995

@lodo1995 any developments? is local development actually possible at the moment?

clausagerskov avatar Feb 20 '23 14:02 clausagerskov

@clausagerskov in general, local development is partially possible. Meaning, some things do work, others (such as the one described in this bug) don't. Your milage may vary.

Regarding this specific bug, as far as I can tell no AWS developer even looked at it. Nor did anyone look at any of the other bugs that I opened. I don't have time to take care of all of this, so I decided to just avoid Sagemaker for the time being.

lodo1995 avatar Feb 20 '23 14:02 lodo1995

Following this as well.

https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines-local-mode.html

Although local mode in documentation is said to be supported, it requires user to upload input code into S3 as stated in this issue. If there is a forced upload as a side effect, what is the reason why this is a must if local mode meant to use local resources to run said pipeline ?

Adamwgoh avatar Jan 09 '24 08:01 Adamwgoh

i wonder if it is possible to emulate an S3 location without having to pay for some tool

clausagerskov avatar Feb 08 '24 16:02 clausagerskov