sagemaker-python-sdk icon indicating copy to clipboard operation
sagemaker-python-sdk copied to clipboard

Cannot mount EFS

Open murphycj opened this issue 3 years ago • 2 comments

Describe the bug I am trying to run a training job with an EFS mount that contains my training data, but it is unable to mount EFS. I've double checked to be sure my training job and EFS are within the same VPC.

Please see my code and error logs below. Thank you.

To reproduce

Here is a simplified version of the code to submit the job to sagemaker:

import os
import time
import sys

import sagemaker
import boto3

from sagemaker.tensorflow import TensorFlow
from sagemaker.inputs import FileSystemInput

sess = boto3.Session()
sm   = sess.client('sagemaker')
role = sagemaker.get_execution_role()
sagemaker_session = sagemaker.Session(boto_session=sess)

job_folder = 'jobs'
dataset_folder = 'datasets'

hyperparams = {}

bucket_name = sagemaker_session.default_bucket()
output_path = 's3://{}/jobs'.format(bucket_name)
job_name    = 'tensorflow-spot-{}'.format(time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime()))

tf_estimator = TensorFlow(
    entry_point='train.py',
    source_dir='code',
    output_path=f'{output_path}/',
    code_location=output_path,
    role=role,
    instance_count=1,
    instance_type='ml.p3.2xlarge',
    framework_version='2.1',
    py_version='py3',
    script_mode=True,
    # use_spot_instances=True,
    # max_wait = 7200,
    max_run=3600,
    sagemaker_session=sagemaker_session,
    hyperparameters=hyperparams,
    subnets=['subnet-05e8dxxxxxxxxxxxx'],
    security_group_ids=['sg-01e85xxxxxxxxxxxxx'])

file_system_input = FileSystemInput(
    file_system_id='fs-3661cec0',
    file_system_type='EFS',
    directory_path='/tensorflow',
    file_system_access_mode='ro')

tf_estimator.fit(file_system_input, job_name=job_name, wait= True)

Expected behavior I expect the training job to be able to mount EFS.

Screenshots or logs

2021-03-11 03:38:19 Starting - Starting the training job...
2021-03-11 03:38:43 Starting - Launching requested ML instancesProfilerReport-1615433898: InProgress
......
2021-03-11 03:39:43 Starting - Preparing the instances for training......
2021-03-11 03:40:48 Failed - Training job failed
..---------------------------------------------------------------------------
UnexpectedStatusException                 Traceback (most recent call last)
<ipython-input-22-abb915c4905d> in <module>
----> 1 tf_estimator.fit(file_system_input, job_name=job_name, wait= True)

~/Envs/tf/lib/python3.7/site-packages/sagemaker/estimator.py in fit(self, inputs, wait, logs, job_name, experiment_config)
    662         self.jobs.append(self.latest_training_job)
    663         if wait:
--> 664             self.latest_training_job.wait(logs=logs)
    665 
    666     def _compilation_job_name(self):

~/Envs/tf/lib/python3.7/site-packages/sagemaker/estimator.py in wait(self, logs)
   1589         # If logs are requested, call logs_for_jobs.
   1590         if logs != "None":
-> 1591             self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
   1592         else:
   1593             self.sagemaker_session.wait_for_job(self.job_name)

~/Envs/tf/lib/python3.7/site-packages/sagemaker/session.py in logs_for_job(self, job_name, wait, poll, log_type)
   3638 
   3639         if wait:
-> 3640             self._check_job_status(job_name, description, "TrainingJobStatus")
   3641             if dot:
   3642                 print()

~/Envs/tf/lib/python3.7/site-packages/sagemaker/session.py in _check_job_status(self, job, desc, status_key_name)
   3220                 ),
   3221                 allowed_statuses=["Completed", "Stopped"],
-> 3222                 actual_status=status,
   3223             )
   3224 

UnexpectedStatusException: Error for Training job tensorflow-spot-2021-03-11-03-38-17: Failed. Reason: ClientError: Unable to mount file system: fs-3661cec0, directory path: /tensorflow. No such file or directory: /tensorflow.

System information A description of your system. Please provide:

  • SageMaker Python SDK version: 2.29.0
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): Tensorflow
  • Framework version: 2.1
  • Python version: 3.7.6
  • CPU or GPU: GPU (ml.p3.2xlarge)
  • Custom Docker image (Y/N): N

murphycj avatar Mar 11 '21 03:03 murphycj