sagemaker-python-sdk
sagemaker-python-sdk copied to clipboard
Cannot mount EFS
Describe the bug I am trying to run a training job with an EFS mount that contains my training data, but it is unable to mount EFS. I've double checked to be sure my training job and EFS are within the same VPC.
Please see my code and error logs below. Thank you.
To reproduce
Here is a simplified version of the code to submit the job to sagemaker:
import os
import time
import sys
import sagemaker
import boto3
from sagemaker.tensorflow import TensorFlow
from sagemaker.inputs import FileSystemInput
sess = boto3.Session()
sm = sess.client('sagemaker')
role = sagemaker.get_execution_role()
sagemaker_session = sagemaker.Session(boto_session=sess)
job_folder = 'jobs'
dataset_folder = 'datasets'
hyperparams = {}
bucket_name = sagemaker_session.default_bucket()
output_path = 's3://{}/jobs'.format(bucket_name)
job_name = 'tensorflow-spot-{}'.format(time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime()))
tf_estimator = TensorFlow(
entry_point='train.py',
source_dir='code',
output_path=f'{output_path}/',
code_location=output_path,
role=role,
instance_count=1,
instance_type='ml.p3.2xlarge',
framework_version='2.1',
py_version='py3',
script_mode=True,
# use_spot_instances=True,
# max_wait = 7200,
max_run=3600,
sagemaker_session=sagemaker_session,
hyperparameters=hyperparams,
subnets=['subnet-05e8dxxxxxxxxxxxx'],
security_group_ids=['sg-01e85xxxxxxxxxxxxx'])
file_system_input = FileSystemInput(
file_system_id='fs-3661cec0',
file_system_type='EFS',
directory_path='/tensorflow',
file_system_access_mode='ro')
tf_estimator.fit(file_system_input, job_name=job_name, wait= True)
Expected behavior I expect the training job to be able to mount EFS.
Screenshots or logs
2021-03-11 03:38:19 Starting - Starting the training job...
2021-03-11 03:38:43 Starting - Launching requested ML instancesProfilerReport-1615433898: InProgress
......
2021-03-11 03:39:43 Starting - Preparing the instances for training......
2021-03-11 03:40:48 Failed - Training job failed
..---------------------------------------------------------------------------
UnexpectedStatusException Traceback (most recent call last)
<ipython-input-22-abb915c4905d> in <module>
----> 1 tf_estimator.fit(file_system_input, job_name=job_name, wait= True)
~/Envs/tf/lib/python3.7/site-packages/sagemaker/estimator.py in fit(self, inputs, wait, logs, job_name, experiment_config)
662 self.jobs.append(self.latest_training_job)
663 if wait:
--> 664 self.latest_training_job.wait(logs=logs)
665
666 def _compilation_job_name(self):
~/Envs/tf/lib/python3.7/site-packages/sagemaker/estimator.py in wait(self, logs)
1589 # If logs are requested, call logs_for_jobs.
1590 if logs != "None":
-> 1591 self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
1592 else:
1593 self.sagemaker_session.wait_for_job(self.job_name)
~/Envs/tf/lib/python3.7/site-packages/sagemaker/session.py in logs_for_job(self, job_name, wait, poll, log_type)
3638
3639 if wait:
-> 3640 self._check_job_status(job_name, description, "TrainingJobStatus")
3641 if dot:
3642 print()
~/Envs/tf/lib/python3.7/site-packages/sagemaker/session.py in _check_job_status(self, job, desc, status_key_name)
3220 ),
3221 allowed_statuses=["Completed", "Stopped"],
-> 3222 actual_status=status,
3223 )
3224
UnexpectedStatusException: Error for Training job tensorflow-spot-2021-03-11-03-38-17: Failed. Reason: ClientError: Unable to mount file system: fs-3661cec0, directory path: /tensorflow. No such file or directory: /tensorflow.
System information A description of your system. Please provide:
- SageMaker Python SDK version: 2.29.0
- Framework name (eg. PyTorch) or algorithm (eg. KMeans): Tensorflow
- Framework version: 2.1
- Python version: 3.7.6
- CPU or GPU: GPU (ml.p3.2xlarge)
- Custom Docker image (Y/N): N