aws-step-functions-data-science-sdk-python
aws-step-functions-data-science-sdk-python copied to clipboard
different name for training job inside estimator than step input
- sagemaker contrainer : conda_pytorch_p36
- estimator mode : 'script mode'
While it is a MUST param that I have to give a name for
TrainingJobNamefrom step functions for data science sdk.
pytorch_estimator = PyTorch(entry_point='HRC_0818_final.py',
train_instance_type='ml.m4.xlarge',
role=role,
train_instance_count=1,
framework_version='1.4.0',
base_job_name = 'kanto-base-job',
)
import stepfunctions
training_step = steps.TrainingStep(
'Model Training',
estimator=pytorch_estimator,
data={
'training': s3_input(s3_data=execution_input['TrainTargetLocation'])
} ,
job_name=execution_input['TrainingJobName'],
wait_for_completion=True
)
model_step = steps.ModelStep(
'Save model',
model=training_step.get_expected_model(),
model_name=execution_input['ModelName'] ,
instance_type='ml.m4.xlarge',
)
execution = workflow.execute(
inputs={
'ModelName': 'kanto-mode-{}'.format(uuid.uuid4().hex),
'TrainTargetLocation' : 's3://hrms-train/traindata/train.jsonl'
}
)
it is still the default training job name inside the estimator with current strtime following base_job_name
"module_dir": "s3://sagemaker-{aws-region}-{aws-id}/{training-job-name}/source/sourcedir.tar.gz",
Then you link a wrong dir to a Model consequently.
SAGEMAKER_SUBMIT_DIRECTORY | s3://sagemaker-{aws-region}-{aws-id}/{base-job-name}-2020-08-20-17-47-50-751/source/sourcedir.tar.gz
I guess the reason is that I have two difference folder for model.tar.gz and sourcedir.tar.gz then leads to a awkward behavior that you can't create consolidated model.tar.gz when you deploy it to server. I can only copy sourcedir.tar.gz to a mms server as this is a default job name. I am missing model.pth consequently.
So, that just leads to put a lambda function that just copies model.tar.gz (model.pth) from TrainTargetLocation folder to default training job folder (strtime named) to make it work correctly.