sagemaker-python-sdk icon indicating copy to clipboard operation
sagemaker-python-sdk copied to clipboard

FrameworkProcessor doesn't install packages in requirements.txt if it's in a Sagemaker Project

Open mstfldmr opened this issue 3 years ago • 1 comments

Describe the bug I created a pipeline with only one ProcessingStep based on FrameworkProcessor. When I upsert() and start() the pipeline from a notebook, it runs well.

The code is in a Sagemaker project, generated using an AWS-provided template. When I push a change to Codecommit, the processing job fails because of a missing package. Although the package is in source_dir/requirements.txt, it's not installed.

Traceback (most recent call last):  File "preprocess.py", line 7, in <module>    import sagemaker
--
ModuleNotFoundError: No module named 'sagemaker'

To reproduce

def get_pipeline(
    region,
    sagemaker_project_arn=None,
    role=None,
    default_bucket=None,
    model_package_group_name="MstfPackageGroup",
    pipeline_name="MstfPipeline",
    base_job_prefix="Mstf",
    feature_group_name="ninja-x5-y3-feature-group-02-14-30-03",
    from_date="2021-05-13",
    to_date="2022-05-13"
):

    sagemaker_session = get_session(region, default_bucket)
    s3_default_bucket = sagemaker_session.default_bucket()
    if role is None:
        role = sagemaker.session.get_execution_role(sagemaker_session)

    # parameters for pipeline execution
    processing_instance_count = ParameterInteger(name="ProcessingInstanceCount", default_value=1)
    processing_instance_type = ParameterString(
        name="ProcessingInstanceType", default_value="ml.m5.xlarge"
    )
    training_instance_type = ParameterString(
        name="TrainingInstanceType", default_value="ml.m5.xlarge"
    )
    model_approval_status = ParameterString(
        name="ModelApprovalStatus", default_value="PendingManualApproval"
    )
    
    feature_group_name_input = ParameterString(
        name="InputFeatureGroupName",
        default_value=feature_group_name
    )
    
    input_s3_bucket = ParameterString(
        name="DefaultS3Bucket",
        default_value=f"s3://{s3_default_bucket}/{base_job_prefix}"
    )
    
    input_from_date = ParameterString(
        name="DataFromDate",
        default_value=from_date
    )
    
    input_to_date = ParameterString(
        name="DataToDate",
        default_value=to_date
    )
    
    est_cls = sagemaker.sklearn.estimator.SKLearn
    framework_version_str="0.23-1"

    script_processor = FrameworkProcessor(
        estimator_cls=est_cls,
        framework_version=framework_version_str,
        role=role,
        instance_count=1,
        instance_type="ml.m5.xlarge",
        sagemaker_session=sagemaker_session
    )
    
    processor_run_args = script_processor.get_run_args(
        code="preprocess.py",
        source_dir=os.path.join(BASE_DIR, "preprocessing"),
        inputs=[],
        outputs=[
            ProcessingOutput(output_name="train", source="/opt/ml/processing/train"),
            ProcessingOutput(output_name="validation", source="/opt/ml/processing/validation"),
            ProcessingOutput(output_name="test", source="/opt/ml/processing/test"),
        ],
        arguments=[
                    "--feature_group_name", feature_group_name_input,
                    "--input_s3_bucket", input_s3_bucket,
                    "--from_date", input_from_date,
                    "--to_date", input_to_date],
    )
    
    step_process = ProcessingStep(
        name="PreprocessMstfData",
        processor=script_processor,
        inputs=processor_run_args.inputs,
        outputs=processor_run_args.outputs,
        job_arguments=processor_run_args.arguments,
        code=processor_run_args.code,
    )


    # pipeline instance
    pipeline = Pipeline(
        name=pipeline_name,
        parameters=[
            processing_instance_type,
            processing_instance_count,
            training_instance_type,
            model_approval_status,
            feature_group_name_input,
            input_s3_bucket,
            input_from_date,
            input_to_date
        ],
        steps=[step_process],
        sagemaker_session=sagemaker_session,
    )
    return pipeline

Expected behavior Install sagemaker package because it's in requirements.txt and run preprocess.py successfully.

System information A description of your system. Please provide:

  • SageMaker Python SDK version: 2.86.2
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): SKLearn
  • Framework version: 0.23-1
  • Python version: 3.7
  • CPU or GPU: CPU
  • Custom Docker image (Y/N): N

Additional context Add any other context about the problem here.

mstfldmr avatar Jun 09 '22 18:06 mstfldmr

Does your codecommit build is the one running the script? If so I would recommend to modify your build.yaml add a setup step to your build to install all requirements before ruining your script.

m3et avatar Jun 15 '22 07:06 m3et

@mstfldmr do you still have this issue after doing what @m3et suggested, or should we close it?

davidbrochart avatar Sep 21 '23 14:09 davidbrochart

Closing this issue for now, feel free to reopen if this issue still persists. Thank you

akrishna1995 avatar Dec 26 '23 20:12 akrishna1995

it worked after adding an empty __init__.py file.

MstfGatherin avatar Dec 27 '23 15:12 MstfGatherin