aws-step-functions-data-science-sdk-python icon indicating copy to clipboard operation
aws-step-functions-data-science-sdk-python copied to clipboard

different name for training job inside estimator than step input

Open yonghyeokrhee opened this issue 5 years ago • 0 comments
trafficstars

  • sagemaker contrainer : conda_pytorch_p36
  • estimator mode : 'script mode' While it is a MUST param that I have to give a name for TrainingJobName from step functions for data science sdk.
pytorch_estimator = PyTorch(entry_point='HRC_0818_final.py',
                            train_instance_type='ml.m4.xlarge',
                            role=role,
                            train_instance_count=1,
                            framework_version='1.4.0',
                            base_job_name = 'kanto-base-job',
                            )
import stepfunctions
training_step = steps.TrainingStep(
    'Model Training', 
    estimator=pytorch_estimator,
    data={
         'training': s3_input(s3_data=execution_input['TrainTargetLocation'])
    } ,
    job_name=execution_input['TrainingJobName'],
    wait_for_completion=True
)

model_step = steps.ModelStep(
    'Save model',
    model=training_step.get_expected_model(),
    model_name=execution_input['ModelName'] ,
    instance_type='ml.m4.xlarge',

)

execution = workflow.execute(
    inputs={
 
        'ModelName': 'kanto-mode-{}'.format(uuid.uuid4().hex),
        'TrainTargetLocation' : 's3://hrms-train/traindata/train.jsonl'
    }
)

it is still the default training job name inside the estimator with current strtime following base_job_name "module_dir": "s3://sagemaker-{aws-region}-{aws-id}/{training-job-name}/source/sourcedir.tar.gz",

Then you link a wrong dir to a Model consequently. SAGEMAKER_SUBMIT_DIRECTORY | s3://sagemaker-{aws-region}-{aws-id}/{base-job-name}-2020-08-20-17-47-50-751/source/sourcedir.tar.gz

I guess the reason is that I have two difference folder for model.tar.gz and sourcedir.tar.gz then leads to a awkward behavior that you can't create consolidated model.tar.gz when you deploy it to server. I can only copy sourcedir.tar.gz to a mms server as this is a default job name. I am missing model.pth consequently.

So, that just leads to put a lambda function that just copies model.tar.gz (model.pth) from TrainTargetLocation folder to default training job folder (strtime named) to make it work correctly.

yonghyeokrhee avatar Aug 22 '20 03:08 yonghyeokrhee