sagemaker-training-toolkit
sagemaker-training-toolkit copied to clipboard
Unable to run a Tensorflow Estimator.
I am unable to run a TesorFlow estimator in local mode. I get this error:
Please advise!
Thank you,
Nektarios
My code:
from sagemaker.tensorflow import TensorFlow
estimator = TensorFlow(entry_point="Testing.py",
role='SageMakerRole',
train_instance_count=1,
train_instance_type='local_gpu',
framework_version='2.2.0',
py_version='py37')
estimator.fit()
My Testing.py:
from pathlib import Path
import pandas as pd
import numpy as np
from tensorflow.keras import Input
from tensorflow.keras.layers import LSTM, Bidirectional, Dropout, Dense, AdditiveAttention, Concatenate, TimeDistributed, Permute
from tensorflow.keras.models import Model
from tensorflow.keras.metrics import RootMeanSquaredError, BinaryAccuracy
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.callbacks import Callback, ModelCheckpoint
import glob
import shutil
import os
from tqdm import tqdm
from pandas.tseries.holiday import USFederalHolidayCalendar as calendar
from utils.common import next_week_day, merge_files
print("Hi")
"C:\Program Files\Python37\python.exe" C:/Development/Projects/DeepTradingAI/deeptradingmodels/DeepTradingEstimator.py
C:\Program Files\Python37\lib\site-packages\sagemaker\local\local_session.py:429: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
self.config = yaml.load(open(sagemaker_config_file, "r"))
Windows Support for Local Mode is Experimental
Legacy mode is deprecated in versions 1.13 and higher. Using script mode instead. Legacy mode and its training parameters will be deprecated in SageMaker Python SDK v2. Please use TF 1.13 or higher and script mode.
'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.
'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.
WARNING! Using --password via the CLI is insecure. Use --password-stdin.
Creating tmpvy3d0546_algo-1-sk9mf_1 ... error
ERROR: for tmpvy3d0546_algo-1-sk9mf_1 Cannot create container for service algo-1-sk9mf: Unknown runtime specified nvidia
ERROR: for algo-1-sk9mf Cannot create container for service algo-1-sk9mf: Unknown runtime specified nvidia
Encountered errors while bringing up the project.
Failed to delete: C:\Users\NEKTAR~1\AppData\Local\Temp\tmpvy3d0546\algo-1-sk9mf
Traceback (most recent call last):
File "C:\Program Files\Python37\lib\site-packages\sagemaker\local\image.py", line 161, in train
_stream_output(process)
File "C:\Program Files\Python37\lib\site-packages\sagemaker\local\image.py", line 677, in _stream_output
raise RuntimeError("Process exited with code: %s" % exit_code)
RuntimeError: Process exited with code: 1
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Program Files\Python37\lib\site-packages\sagemaker\local\image.py", line 166, in train
raise RuntimeError(msg)
RuntimeError: Failed to run: ['docker-compose', '-f', 'C:\\Users\\NEKTAR~1\\AppData\\Local\\Temp\\tmpvy3d0546\\docker-compose.yaml', 'up', '--build', '--abort-on-container-exit'], Process exited with code: 1
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:/Development/Projects/DeepTradingAI/deeptradingmodels/DeepTradingEstimator.py", line 11, in <module>
estimator.fit()
File "C:\Program Files\Python37\lib\site-packages\sagemaker\tensorflow\estimator.py", line 483, in fit
fit_super()
File "C:\Program Files\Python37\lib\site-packages\sagemaker\tensorflow\estimator.py", line 462, in fit_super
super(TensorFlow, self).fit(inputs, wait, logs, job_name, experiment_config)
File "C:\Program Files\Python37\lib\site-packages\sagemaker\estimator.py", line 494, in fit
self.latest_training_job = _TrainingJob.start_new(self, inputs, experiment_config)
File "C:\Program Files\Python37\lib\site-packages\sagemaker\estimator.py", line 1066, in start_new
estimator.sagemaker_session.train(**train_args)
File "C:\Program Files\Python37\lib\site-packages\sagemaker\session.py", line 590, in train
self.sagemaker_client.create_training_job(**train_request)
File "C:\Program Files\Python37\lib\site-packages\sagemaker\local\local_session.py", line 102, in create_training_job
training_job.start(InputDataConfig, OutputDataConfig, hyperparameters, TrainingJobName)
File "C:\Program Files\Python37\lib\site-packages\sagemaker\local\entities.py", line 96, in start
input_data_config, output_data_config, hyperparameters, job_name
File "C:\Program Files\Python37\lib\site-packages\sagemaker\local\image.py", line 174, in train
self._cleanup(dirs_to_delete)
File "C:\Program Files\Python37\lib\site-packages\sagemaker\local\image.py", line 597, in _cleanup
_delete_tree(container_config_path)
File "C:\Program Files\Python37\lib\site-packages\sagemaker\local\image.py", line 723, in _delete_tree
shutil.rmtree(path)
File "C:\Program Files\Python37\lib\shutil.py", line 516, in rmtree
return _rmtree_unsafe(path, onerror)
File "C:\Program Files\Python37\lib\shutil.py", line 395, in _rmtree_unsafe
_rmtree_unsafe(fullname, onerror)
File "C:\Program Files\Python37\lib\shutil.py", line 404, in _rmtree_unsafe
onerror(os.rmdir, path, sys.exc_info())
File "C:\Program Files\Python37\lib\shutil.py", line 402, in _rmtree_unsafe
os.rmdir(path)
OSError: [WinError 145] The directory is not empty: 'C:\\Users\\NEKTAR~1\\AppData\\Local\\Temp\\tmpvy3d0546\\algo-1-sk9mf\\output'
Describe the bug A clear and concise description of what the bug is.
To reproduce A clear, step-by-step set of instructions to reproduce the bug.
Expected behavior A clear and concise description of what you expected to happen.
Screenshots or logs If applicable, add screenshots or logs to help explain your problem.
System information A description of your system.
- Include the version of SageMaker Training Toolkit you are using.
- If you are using a prebuilt Amazon SageMaker Docker image, provide the URL.
- If you are using a custom Docker image, provide:
- framework name (eg. PyTorch)
- framework version
- Python version
- processing unit type (ie. CPU or GPU)
Additional context Add any other context about the problem here.
@nectario what is your system setup? are you on a GPU instance?
Yes, I am using a GPU on my local machine. It's the Nvidia Titan Volta. Python Version: 3.7 TensorFlow Version: 2.2.0
ERROR: for algo-1-sk9mf Cannot create container for service algo-1-sk9mf: Unknown runtime specified nvidia
I've usually seen this error when running a GPU image on a CPU instance, but since you're on a GPU, then I don't have any immediate ideas. Do you have nvidia-docker2 installed? Can you try restarting the docker daemon (as per this GitHub issue with a similar error message)? Are you able to run other GPU Docker images?