sagemaker-python-sdk icon indicating copy to clipboard operation
sagemaker-python-sdk copied to clipboard

FrameworkProcessor is broken with SageMaker Pipelines

Open dgallitelli opened this issue 2 years ago • 21 comments

Describe the bug Trying to use any Processor derived from FrameworkProcessor is bugged with SageMaker Pipelines. There is a problem with the command and entrypoint parameter, where command does not pass python3, causing the following error:

line 2: import: command not found

To reproduce

  1. Create a FrameworkProcessor (i.e. PyTorchProcessor, TensorFlowProcessor)
  2. Create a ProcessingStep and a Pipeline
  3. Execute it
  4. See it fail

Expected behavior The pipeline should go through.

Screenshots or logs

Screenshot from Pipelines: image

Logs from CloudWatch:

/opt/ml/processing/input/entrypoint/inference_with_processing.py: line 2: import: command not found
/opt/ml/processing/input/entrypoint/inference_with_processing.py: line 3: import: command not found
/opt/ml/processing/input/entrypoint/inference_with_processing.py: line 4: import: command not found
/opt/ml/processing/input/entrypoint/inference_with_processing.py: line 5: import: command not found
/opt/ml/processing/input/entrypoint/inference_with_processing.py: line 6: from: command not found

System information A description of your system. Please provide:

  • SageMaker Python SDK version: 2.57.0
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): Every Framework
  • Framework version: Every version supported by SM
  • Python version: 3.8
  • CPU or GPU: CPU and GPU
  • Custom Docker image (Y/N): N

Additional context N/A

dgallitelli avatar Sep 23 '21 15:09 dgallitelli

Thanks for raising this @dgallitelli

As discussed offline & detailed further on the linked PR, the integration between FrameworkProcessor and ProcessingStep is currently broken: The error you see is caused by the framework processor incorrectly trying to treat your processing script (Python) as the framework bootstrap script (shell).

We're actively working on a solution, but some possible things I could suggest to try in the interim if you need:

  • If your job is a single script and you don't need requirements.txt dependencies, maybe can try using the ScriptProcessor instead and explicitly passing in the PyTorch/TensorFlow container URI?
  • I wonder if adding a shebang comment something like #!/usr/bin/python3 (not sure whether this is the actual installed location of Python in these containers) at the very top of your script file could persuade bash to run the script through the Python interpreter instead?

athewsey avatar Sep 23 '21 15:09 athewsey

UPDATE 1: Adding shebang does not currently force ProcessingStep into using Python3 in the command.

dgallitelli avatar Sep 23 '21 15:09 dgallitelli

UPDATE 2: ScriptProcessor does work, however there is no support for source_dir parameter (as commented above by @athewsey ). If you need custom dependencies or for multi-files script, create your own custom container by extending the SM images for TF/PyTorch/HuggingFace/MXNet.

For those who need some sort of directions on how to change from FrameworkProcessor to ScriptProcessor, here is an example for TF2.3:

##### COMMENT THE TENSORFLOWPROCESSOR
 
# from sagemaker.tensorflow import TensorFlowProcessor
# tp = TensorFlowProcessor(
#     framework_version='2.3',
#     role = get_execution_role(),
#     instance_count=1,
#     instance_type='ml.m5.large',
#     base_job_name='DSM-TF-Demo-Process',
#     py_version='py37'
# )
 
 
##### AND REPLACE WITH
 
from sagemaker.image_uris import retrieve
from sagemaker.processing import ScriptProcessor
from sagemaker import get_execution_role
 
image_uri = retrieve(
    framework='tensorflow', 
    region='eu-west-1', 
    version='2.3', 
    py_version='py37', 
    image_scope='training',
    instance_type='ml.m5.xlarge'
)
sp = ScriptProcessor(
    role=get_execution_role(),
    image_uri=image_uri,
    command=['python3'],
    instance_count=1,
    instance_type='ml.m5.xlarge'
)
# Now, either run sp.run() or create a sagemaker.workflow.steps.ProcessingStep() , as needed

A very short example of a Dockerfile to extend the default TF container and install dependencies (not tested yet):

FROM 763104351884.dkr.ecr.eu-west-1.amazonaws.com/tensorflow-training:2.3-cpu-py37
COPY requirements.txt /opt/ml/processing/input/code/requirements.txt
RUN pip install -r /opt/ml/processing/input/code/requirements.txt

dgallitelli avatar Sep 23 '21 16:09 dgallitelli

Any updates on this ?

khalidbourhaba avatar Mar 15 '22 14:03 khalidbourhaba

Is there an ETA on this fix?

darkreapyre avatar Mar 16 '22 15:03 darkreapyre

Is this issue fixed?

AakankshCTR avatar Apr 22 '22 13:04 AakankshCTR

This is still not fixed as of today (May 10th 2022).

dgallitelli avatar May 10 '22 07:05 dgallitelli

Any update on this issue? Facing the same problem.

morfaer avatar May 16 '22 14:05 morfaer

Still the case for now.

However, there is now a possibility to use the new sagemaker.workflow.pipeline_context.PipelineSession to have the .run() generate the arguments without actually running the Processing job. Tried in a Jupyter Notebook with a custom FrameworkProcessor, but should work with any FrameworkProcessor. Your code would look like:

from sagemaker.sklearn import SKLearn, SKLearnProcessor
from sagemaker.processing import FrameworkProcessor  # or change with any other FrameworkProcessor like HuggingFaceProcessor
from sagemaker.workflow.pipeline_context import PipelineSession

session = PipelineSession()

skpv2 = FrameworkProcessor(
    estimator_cls=SKLearn,
    framework_version='0.23-1',
    role = get_execution_role(),
    instance_count=1,
    instance_type='ml.m5.large',
    sagemaker_session = session
)

step_args = skpv2.run(
    code='processing.py',
    source_dir="code", # add processing.py and requirements.txt here
    inputs=[...], outputs=[...]
)

from sagemaker.workflow.pipeline import Pipeline
from sagemaker.workflow.steps import ProcessingStep

processing_step = ProcessingStep(
    name="MyProcessingStep",
    step_args=step_args
)

# [ define the other steps if any ]

pipeline = Pipeline(steps=[...])

Just make sure to update the SageMaker Python SDK to the latest version :)

dgallitelli avatar May 16 '22 15:05 dgallitelli

Thanks @dgallitelli

We would encourage users to adopt this new way to construct TrainingStep, ProcessingStep, TransformStep, TuningStep, and ModelStep.

We have a readthedocs about to releasing to introduce all the improvements we made to the SageMaker pythonSDK Pipeline module.

jerrypeng7773 avatar May 16 '22 15:05 jerrypeng7773

The FrameworkProcessor has a method called get_run_args (doc here) that is designed to help integrate this processor to the ProcessingStep, which can be put within a SageMaker pipeline. If you want to add pip dependencies, you can add a requirements.txt file under BASE_DIR.

Here is a simplified code that helps to connect the dots between: FrameworkProcessor, get_run_args, ProcessingStep and Pipeline.


from sagemaker.processing import (
    ProcessingInput,
    ProcessingOutput,
    FrameworkProcessor
)

from sagemaker.workflow.functions import Join
from sagemaker.workflow.pipeline import Pipeline
from sagemaker.workflow.steps import (
    ProcessingStep
)

from sagemaker.tensorflow import TensorFlow

BASE_DIR = os.path.dirname(os.path.realpath(__file__))

preprocessing_processor = FrameworkProcessor(
    estimator_cls=TensorFlow,
    framework_version='2.4.3',
    role=role,
    instance_count=1,
    instance_type="ml.m5.xlarge",
    py_version='py37',
    command=["python3"],
    base_job_name="some-preprocessing-step"
)

train_data_in_s3 = ProcessingOutput(
    source="/opt/ml/processing/output/train/",
    destination=Join(
        on="/",
        values=[
            "s3:/",
            data_s3_bucket,
            os.environ["SAGEMAKER_PROJECT_NAME"],
            data_s3_key,
            'train/'
        ],
    ),
    output_name='train',
    s3_upload_mode='Continuous',
)

test_data_in_s3 = ProcessingOutput(
    source="/opt/ml/processing/output/test/",
    destination=Join(
        on="/",
        values=[
            "s3:/",
            data_s3_bucket,
            os.environ["SAGEMAKER_PROJECT_NAME"],
            data_s3_key,
            'test/'
        ],
    ),
    output_name='test',
    s3_upload_mode='Continuous',
)

data_s3_key_in_project = Join(
    on="/",
    values=[
        os.environ["SAGEMAKER_PROJECT_NAME"],
        data_s3_key
    ],
)

preprocessing_run_args = preprocessing_processor.get_run_args(
    code="preprocess.py",
    source_dir=BASE_DIR,
    inputs=[],
    outputs=[train_data_in_s3, test_data_in_s3],
    arguments=[
        '--data-s3-bucket', "your bucket name",
        '--data-s3-key', "your key"
    ]
)

preprocessing_step = ProcessingStep(
    name="your-preprocessing-step-name",
    processor=preprocessing_processor,
    inputs=preprocessing_run_args.inputs,
    outputs=preprocessing_run_args.outputs,
    job_arguments=preprocessing_run_args.arguments,
    code=preprocessing_run_args.code
)

pipeline_name = "your-pipeline-name"

distributed_ml_training_pipeline = Pipeline(
    name=pipeline_name,
    parameters=[
        # your pipeline parameters here
    ],
    steps=[preprocessing_step, ...]
)

If you are using this inside a SageMaker Studio MLOps Project, make sure to declare your requirements.txt inside a MANIFEST.in file to be shipped with the library: https://packaging.python.org/en/latest/guides/using-manifest-in/.

mohamed-ali avatar May 19 '22 10:05 mohamed-ali

Is there any running example for a ProcessingStep with PyTorch that allows source_dir?

marianokamp avatar May 21 '22 11:05 marianokamp

Thanks @dgallitelli

We would encourage users to adopt this new way to construct TrainingStep, ProcessingStep, TransformStep, TuningStep, and ModelStep.

We have a readthedocs about to releasing to introduce all the improvements we made to the SageMaker pythonSDK Pipeline module.

Any update when this updated readthedocs will be released?

morfaer avatar Jun 07 '22 12:06 morfaer

@morfaer here we go https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_model_building_pipeline.html

jerrypeng7773 avatar Jun 07 '22 16:06 jerrypeng7773

I tried the ModelStep example from docs here: https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_model_building_pipeline.html#model-step

Assuming that the Model object in this example is sagemaker.model.Model, the register method returns a ModelPackage | None type but the step_args argument in the ModelStep call expects an object of type _ModelStepArguments. So type-wise this example looks fishy to me.

Also this example creates the following error for me

  File "/home/.../run_pipeline.py", line 261, in model_registration_step
    register_model_step_args = model.register(
  File "/home/.../.venv/lib/python3.8/site-packages/sagemaker/workflow/pipeline_context.py", line 209, in wrapper
    return run_func(*args, **kwargs)
  File "/home/.../.venv/lib/python3.8/site-packages/sagemaker/model.py", line 373, in register
    model_package = self.sagemaker_session.create_model_package_from_containers(
AttributeError: 'NoneType' object has no attribute 'create_model_package_from_containers'

So it seems like this example does not work.

I'm using version 2.94.0 of the Sagemaker SDK from a local PC (not Sagemaker notebook) to start the Pipeline. Any ideas how this is supposed to work?

morfaer avatar Jun 09 '22 14:06 morfaer

I tried the ModelStep example from docs here: https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_model_building_pipeline.html#model-step

Assuming that the Model object in this example is sagemaker.model.Model, the register method returns a ModelPackage | None type but the step_args argument in the ModelStep call expects an object of type _ModelStepArguments. So type-wise this example looks fishy to me.

Also this example creates the following error for me

  File "/home/.../run_pipeline.py", line 261, in model_registration_step
    register_model_step_args = model.register(
  File "/home/.../.venv/lib/python3.8/site-packages/sagemaker/workflow/pipeline_context.py", line 209, in wrapper
    return run_func(*args, **kwargs)
  File "/home/.../.venv/lib/python3.8/site-packages/sagemaker/model.py", line 373, in register
    model_package = self.sagemaker_session.create_model_package_from_containers(
AttributeError: 'NoneType' object has no attribute 'create_model_package_from_containers'

So it seems like this example does not work.

I'm using version 2.94.0 of the Sagemaker SDK from a local PC (not Sagemaker notebook) to start the Pipeline. Any ideas how this is supposed to work?

can you please confirm sagemaker.workflow.pipeline_context.PipelineSession() is used? more specifically,

from sagemaker.workflow.pipeline_context import PipelineSession
pipeline_session = PipelineSession()
....
model = Model(
    image_uri=pytorch_estimator.training_image_uri(),
    model_data=step_train.properties.ModelArtifacts.S3ModelArtifacts,
    **sagemaker_session=pipeline_session**,
    role=role,
)
register_args = model.register(...)

jerrypeng7773 avatar Jun 09 '22 17:06 jerrypeng7773

If you are using this inside a SageMaker Studio MLOps Project, make sure to declare your requirements.txt inside a MANIFEST.in file to be shipped with the library: https://packaging.python.org/en/latest/guides/using-manifest-in/.

Should MANIFEST.in be located in source_dir or somewhere else?

mstfldmr avatar Jun 09 '22 18:06 mstfldmr

If you are using this inside a SageMaker Studio MLOps Project, make sure to declare your requirements.txt inside a MANIFEST.in file to be shipped with the library: https://packaging.python.org/en/latest/guides/using-manifest-in/.

Should MANIFEST.in be located in source_dir or somewhere else?

It should be at the same level as the setup.py.

mohamed-ali avatar Jun 10 '22 07:06 mohamed-ali

@jerrypeng7773 The additional pipeline_session parameter worked for. Thanks. What's weird though is that the ModelStep appends a RegisterModel suffix to my pipeline step in Sagemaker Studio.

I guess the reason is the _append_register_model_step method in model_step.py in the workflow module.

The other step types don't seem to add such a suffix so this looks inconsistent to me. Is there a specific reason to do this for ModelStep?

morfaer avatar Jun 10 '22 09:06 morfaer

Hi @morfaer, both ModelStep and RegisterModel are StepCollection. StepCollection is not an actual step/graph node that shows up in the pipeline graph, instead it is a collection of steps/graph nodes. Both ModelStep and RegisterModel can contain the following list of steps:

  • _RepackModelStep (optional): this step is a TrainingStep under the hood, which do model repacking in execution time, if some specific conditions are met. The model repacking is to repack model artifacts with custom inference entry points and generate the new repacked model artifacts for registry.
  • _RegisterModelStep: it is the step which is doing the model registry in the execution.

What's weird though is that the ModelStep appends a RegisterModel suffix to my pipeline step

Compared with the RegisterModel, ModelStep explicitly appends the -RegisterModel suffix, so that its sub step names look like:

  • "-RepackModel-" for _RepackModelStep
  • "-RegisterModel" for _RegisterModelStep

This can give users a clear hint that what each sub step is doing.

In addition, we recently pushed this PR: https://github.com/aws/sagemaker-python-sdk/pull/3240 to apply this naming convention of ModelStep to RegisterModel as well because RegisterModel's previous naming manner can cause some issues.

qidewenwhen avatar Jul 15 '22 01:07 qidewenwhen

Still the case for now.

However, there is now a possibility to use the new sagemaker.workflow.pipeline_context.PipelineSession to have the .run() generate the arguments without actually running the Processing job. Tried in a Jupyter Notebook with a custom FrameworkProcessor, but should work with any FrameworkProcessor. Your code would look like:

from sagemaker.sklearn import SKLearn, SKLearnProcessor
from sagemaker.processing import FrameworkProcessor  # or change with any other FrameworkProcessor like HuggingFaceProcessor
from sagemaker.workflow.pipeline_context import PipelineSession

session = PipelineSession()

skpv2 = FrameworkProcessor(
    estimator_cls=SKLearn,
    framework_version='0.23-1',
    role = get_execution_role(),
    instance_count=1,
    instance_type='ml.m5.large',
    sagemaker_session = session
)

step_args = skpv2.run(
    code='processing.py',
    source_dir="code", # add processing.py and requirements.txt here
    inputs=[...], outputs=[...]
)

from sagemaker.workflow.pipeline import Pipeline
from sagemaker.workflow.steps import ProcessingStep

processing_step = ProcessingStep(
    name="MyProcessingStep",
    step_args=step_args
)

# [ define the other steps if any ]

pipeline = Pipeline(steps=[...])

Just make sure to update the SageMaker Python SDK to the latest version :)

This method works... it creates a processor instance which the ProcessingStep accepts in the step_args parameter.

The method proposed by mohamed-ali did not work for me... it creates a list of arguments for the job-arguments parameter.

It may be worth noting here that each method works on a different parameter. When using the step_args parameter, you cannot use the processor argument... because the processor instance supplies the processor for the ProcessingStep.

When using the job-arguments parameter, there were no conflicts with the processor parameter, but the job still failed citing the missing 'code' without any reference to the missing 'source_dir'... this means it may have solved the source directory issue, but there was an issue retrieving the .py file from it.

From this experience, it appears that using the step-args method via the PipelineSession is a good idea.

pjbhaumik avatar Jul 21 '22 19:07 pjbhaumik

Is there any update on this issue, has it been fixed? Currently running into the same problem

MetelStairs avatar Oct 10 '22 12:10 MetelStairs

Is there any update on this issue, has it been fixed? Currently running into the same problem

Yes, using a PipelineSession() will work... follow the example from dgallitelli posted 5/16.

pjbhaumik avatar Oct 10 '22 13:10 pjbhaumik