sagemaker-tensorflow-training-toolkit
sagemaker-tensorflow-training-toolkit copied to clipboard
Add Instructions for Contributing to the project
I am trying to get to the bottom of a problem #413 causing my deployed tensorflow model to fail.
The model is a simple and deploys with basic instructions to GCP MLE. The serving function which errors out on sagemaker works fine on MLE.
The problem seems to be in the way the sagemaker container processes the input.
As such I have started to debug locally but I am guessing about how to do that properly and am currently unsure how the local sagemaker container assumes the role passed to the TensorFlow
constructor.
Currently, I am building the latest sagemaker-tensorflow-container image at v 1.10.0
and calling it from a local notebook instance using the MNIST example provided by amazon-sagemaker-examples:
from sagemaker.tensorflow import TensorFlow
mnist_estimator = TensorFlow(entry_point='mnist.py',
role=role,
framework_version='1.10.0',
training_steps=10,
evaluation_steps=10,
train_instance_count=2,
train_instance_type='local',
image_name='my-sm-tensorflow:1.10.0-cpu-py2',
)
# mnist_estimator.fit(inputs)
local_inputs = 'file://{}/data/'.format(os.getcwd())
mnist_estimator.fit(local_inputs)
however the local container fails because it cannot get an object from s3:
INFO:sagemaker:Creating training-job with name: my-sm-tensorflow-2018-10-08-05-34-16-185
Creating tmp6pytpo_algo-2-GGF0S_1 ...
Creating tmp6pytpo_algo-1-GGF0S_1 ...
Attaching to tmp6pytpo_algo-1-GGF0S_1, tmp6pytpo_algo-2-GGF0S_1
algo-1-GGF0S_1 | 2018-10-08 05:34:25,817 INFO - root - running container entrypoint
algo-1-GGF0S_1 | 2018-10-08 05:34:25,818 INFO - root - starting train task
algo-1-GGF0S_1 | 2018-10-08 05:34:25,841 INFO - container_support.training - Training starting
algo-2-GGF0S_1 | 2018-10-08 05:34:26,845 INFO - root - running container entrypoint
algo-2-GGF0S_1 | 2018-10-08 05:34:26,846 INFO - root - starting train task
algo-2-GGF0S_1 | 2018-10-08 05:34:26,873 INFO - container_support.training - Training starting
algo-1-GGF0S_1 | 2018-10-08 05:34:26,974 INFO - botocore.credentials - Found credentials in shared credentials file: ~/.aws/credentials
algo-1-GGF0S_1 | Downloading s3://sagemaker-ap-southeast-2-167464700695/my-sm-tensorflow-2018-10-08-05-34-16-185/source/sourcedir.tar.gz to /tmp/script.tar.gz
algo-1-GGF0S_1 | 2018-10-08 05:34:27,433 ERROR - container_support.training - uncaught exception during training: An error occurred (403) when calling the HeadObject operation: Forbidden
algo-1-GGF0S_1 | Traceback (most recent call last):
algo-1-GGF0S_1 | File "/usr/local/lib/python2.7/dist-packages/container_support/training.py", line 36, in start
algo-1-GGF0S_1 | fw.train()
algo-1-GGF0S_1 | File "/usr/local/lib/python2.7/dist-packages/tf_container/train_entry_point.py", line 140, in train
algo-1-GGF0S_1 | env.download_user_module()
algo-1-GGF0S_1 | File "/usr/local/lib/python2.7/dist-packages/container_support/environment.py", line 89, in download_user_module
algo-1-GGF0S_1 | cs.download_s3_resource(self.user_script_archive, tmp)
algo-1-GGF0S_1 | File "/usr/local/lib/python2.7/dist-packages/container_support/utils.py", line 41, in download_s3_resource
algo-1-GGF0S_1 | script_bucket.download_file(script_key_name, target)
algo-1-GGF0S_1 | File "/usr/local/lib/python2.7/dist-packages/boto3/s3/inject.py", line 246, in bucket_download_file
algo-1-GGF0S_1 | ExtraArgs=ExtraArgs, Callback=Callback, Config=Config)
algo-1-GGF0S_1 | File "/usr/local/lib/python2.7/dist-packages/boto3/s3/inject.py", line 172, in download_file
algo-1-GGF0S_1 | extra_args=ExtraArgs, callback=Callback)
algo-1-GGF0S_1 | File "/usr/local/lib/python2.7/dist-packages/boto3/s3/transfer.py", line 307, in download_file
algo-1-GGF0S_1 | future.result()
algo-1-GGF0S_1 | File "/usr/local/lib/python2.7/dist-packages/s3transfer/futures.py", line 73, in result
algo-1-GGF0S_1 | return self._coordinator.result()
algo-1-GGF0S_1 | File "/usr/local/lib/python2.7/dist-packages/s3transfer/futures.py", line 233, in result
algo-1-GGF0S_1 | raise self._exception
algo-1-GGF0S_1 | ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden
algo-1-GGF0S_1 |
algo-1-GGF0S_1 |
tmp6pytpo_algo-1-GGF0S_1 exited with code 1
Stopping tmp6pytpo_algo-2-GGF0S_1 ...
Aborting on container exit... ... done
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-7-9d694d5f5d5b> in <module>()
4 # try local inputs
5 local_inputs = 'file://{}/data/'.format(os.getcwd())
----> 6 mnist_estimator.fit(local_inputs)
/Users/jimbo/Code/sagemaker-python-sdk/src/sagemaker/tensorflow/estimator.pyc in fit(self, inputs, wait, logs, job_name, run_tensorboard_locally)
248 tensorboard.join()
249 else:
--> 250 fit_super()
251
252 @classmethod
/Users/jimbo/Code/sagemaker-python-sdk/src/sagemaker/tensorflow/estimator.pyc in fit_super()
230 """
231 def fit_super():
--> 232 super(TensorFlow, self).fit(inputs, wait, logs, job_name)
233
234 if run_tensorboard_locally and wait is False:
/Users/jimbo/Code/sagemaker-python-sdk/src/sagemaker/estimator.pyc in fit(self, inputs, wait, logs, job_name)
190 self._prepare_for_training(job_name=job_name)
191
--> 192 self.latest_training_job = _TrainingJob.start_new(self, inputs)
193 if wait:
194 self.latest_training_job.wait(logs=logs)
/Users/jimbo/Code/sagemaker-python-sdk/src/sagemaker/estimator.pyc in start_new(cls, estimator, inputs)
432 resource_config=config['resource_config'], vpc_config=config['vpc_config'],
433 hyperparameters=hyperparameters, stop_condition=config['stop_condition'],
--> 434 tags=estimator.tags)
435
436 return cls(estimator.sagemaker_session, estimator._current_job_name)
/Users/jimbo/Code/sagemaker-python-sdk/src/sagemaker/session.pyc in train(self, image, input_mode, input_config, role, job_name, output_config, resource_config, vpc_config, hyperparameters, stop_condition, tags)
277 LOGGER.info('Creating training-job with name: {}'.format(job_name))
278 LOGGER.debug('train request: {}'.format(json.dumps(train_request, indent=4)))
--> 279 self.sagemaker_client.create_training_job(**train_request)
280
281 def tune(self, job_name, strategy, objective_type, objective_metric_name,
/Users/jimbo/Code/sagemaker-python-sdk/src/sagemaker/local/local_session.pyc in create_training_job(self, TrainingJobName, AlgorithmSpecification, InputDataConfig, OutputDataConfig, ResourceConfig, **kwargs)
73 training_job = _LocalTrainingJob(container)
74 hyperparameters = kwargs['HyperParameters'] if 'HyperParameters' in kwargs else {}
---> 75 training_job.start(InputDataConfig, hyperparameters)
76
77 LocalSagemakerClient._training_jobs[TrainingJobName] = training_job
/Users/jimbo/Code/sagemaker-python-sdk/src/sagemaker/local/entities.pyc in start(self, input_data_config, hyperparameters)
58 self.state = self._TRAINING
59
---> 60 self.model_artifacts = self.container.train(input_data_config, hyperparameters)
61 self.end = datetime.datetime.now()
62 self.state = self._COMPLETED
/Users/jimbo/Code/sagemaker-python-sdk/src/sagemaker/local/image.pyc in train(self, input_data_config, hyperparameters)
124 # which contains the exit code and append the command line to it.
125 msg = "Failed to run: %s, %s" % (compose_command, str(e))
--> 126 raise RuntimeError(msg)
127
128 s3_artifacts = self.retrieve_artifacts(compose_data)
RuntimeError: Failed to run: ['docker-compose', '-f', '/private/var/folders/1x/gyr4jt_s3jqc2c88vy74btnm0000gn/T/tmp6PyTpo/docker-compose.yaml', 'up', '--build', '--abort-on-container-exit'], Process exited with code: 1
The role can be verified to copy the object, leading me to suppose that the container does not assume the role properly.
I wonder how it is meant to assume the role?
The instructions to build the container image locally are clear, thank you for that. I would like to see something in the README.md or CONTIBUTING.md that shows the recomended process of developing the container and calling the built image locally.
Do you have docker-compose installed?
I believe the AmazonSageMakerFullAccess policy has by default an S3 condition in which the S3 bucket has to have the word sagemaker within the bucket name.
In addition for local mode, I believe since you have your AWS credentials set it should be passed properly to the container.
https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/local/image.py#L631
Could you perhaps trying passing exporting your credentials as environment variables?
Hi @ChoiByungWook, I appreciate your help.
The problem is that inside the container, the ExecutionRole passed to the TensorFlow
constructor is not being assumed. Let me convince you.
When debugging these issues I assume the role from my local machine and run under that role.
You can see from the stacktrace above that I have explicitly built the docker image with a credentials file of a user that is able to assume the ExecutionRole:
algo-1-GGF0S_1 | 2018-10-08 05:34:26,974 INFO - botocore.credentials - Found credentials in shared credentials file: ~/.aws/credentials
After that point the HeadObject
API call fails (403).
algo-1-GGF0S_1 | ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden
Now I use the very same credentials file to authenticate and assume the intended role.
Observe the following executed with the awscli
from my own machine:
(general) tim@tim:.aws ❯ aws sts get-caller-identity
{
"UserId": "AROAIL7MXNDJIECXXGR34:botocore-session-1539129598",
"Account": "167464700695",
"Arn": "arn:aws:sts::167464700695:assumed-role/AmazonSageMaker-ExecutionRole-20180907T092630/botocore-session-1539129598"
}
(general) tim@tim:.aws ❯ aws s3api head-object --bucket sagemaker-ap-southeast-2-167464700695 --key tims-sm-tensorflow-2018-10-08-05-34-16-185/source/sourcedir.tar.gz
{
"AcceptRanges": "bytes",
"LastModified": "Mon, 08 Oct 2018 05:34:20 GMT",
"ContentLength": 1495,
"ETag": "\"609b922764cac00005bbe2d6dfa17475\"",
"ContentType": "binary/octet-stream",
"Metadata": {}
}
I hope this makes it clear that the role has permissions to HeadObject
.
I am sure that I am simply not running the local container properly. Perhaps there are requirements about the notebook environment that instantiate the TensorFlow
object and what role it has. I suspect we will discover that in time. This is why I would like to see the official recommendations for how to develop the container locally.
Once again, thanks for your help. I think Sagemaker has fantastic potential and am quite keen to contribute.
I agree that i's confusing that the role passed in TensorFlow estimator is not actually used in the containers with local mode. As mentioned in aws/sagemaker-python-sdk#413, we will update our document.