sagemaker-training-toolkit icon indicating copy to clipboard operation
sagemaker-training-toolkit copied to clipboard

Unable to run a Tensorflow Estimator.

Open nectario opened this issue 4 years ago • 3 comments

I am unable to run a TesorFlow estimator in local mode. I get this error:

Please advise!

Thank you,

Nektarios

My code:

from sagemaker.tensorflow import  TensorFlow

estimator = TensorFlow(entry_point="Testing.py",
                        role='SageMakerRole',
                        train_instance_count=1,
                        train_instance_type='local_gpu',
                        framework_version='2.2.0',
                        py_version='py37')
estimator.fit()

My Testing.py:

from pathlib import Path

import pandas as pd

import numpy as np

from tensorflow.keras import Input
from tensorflow.keras.layers import LSTM, Bidirectional, Dropout, Dense, AdditiveAttention, Concatenate, TimeDistributed, Permute
from tensorflow.keras.models import Model
from tensorflow.keras.metrics import RootMeanSquaredError, BinaryAccuracy
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.callbacks import Callback, ModelCheckpoint
import glob
import shutil
import os
from tqdm import tqdm

from pandas.tseries.holiday import USFederalHolidayCalendar as calendar
from utils.common import next_week_day, merge_files


print("Hi")
"C:\Program Files\Python37\python.exe" C:/Development/Projects/DeepTradingAI/deeptradingmodels/DeepTradingEstimator.py
C:\Program Files\Python37\lib\site-packages\sagemaker\local\local_session.py:429: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  self.config = yaml.load(open(sagemaker_config_file, "r"))
Windows Support for Local Mode is Experimental
Legacy mode is deprecated in versions 1.13 and higher. Using script mode instead. Legacy mode and its training parameters will be deprecated in SageMaker Python SDK v2. Please use TF 1.13 or higher and script mode.
'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.
'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.
WARNING! Using --password via the CLI is insecure. Use --password-stdin.

Creating tmpvy3d0546_algo-1-sk9mf_1 ... error
ERROR: for tmpvy3d0546_algo-1-sk9mf_1  Cannot create container for service algo-1-sk9mf: Unknown runtime specified nvidia

ERROR: for algo-1-sk9mf  Cannot create container for service algo-1-sk9mf: Unknown runtime specified nvidia
Encountered errors while bringing up the project.
Failed to delete: C:\Users\NEKTAR~1\AppData\Local\Temp\tmpvy3d0546\algo-1-sk9mf
Traceback (most recent call last):
  File "C:\Program Files\Python37\lib\site-packages\sagemaker\local\image.py", line 161, in train
    _stream_output(process)
  File "C:\Program Files\Python37\lib\site-packages\sagemaker\local\image.py", line 677, in _stream_output
    raise RuntimeError("Process exited with code: %s" % exit_code)
RuntimeError: Process exited with code: 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Program Files\Python37\lib\site-packages\sagemaker\local\image.py", line 166, in train
    raise RuntimeError(msg)
RuntimeError: Failed to run: ['docker-compose', '-f', 'C:\\Users\\NEKTAR~1\\AppData\\Local\\Temp\\tmpvy3d0546\\docker-compose.yaml', 'up', '--build', '--abort-on-container-exit'], Process exited with code: 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:/Development/Projects/DeepTradingAI/deeptradingmodels/DeepTradingEstimator.py", line 11, in <module>
    estimator.fit()
  File "C:\Program Files\Python37\lib\site-packages\sagemaker\tensorflow\estimator.py", line 483, in fit
    fit_super()
  File "C:\Program Files\Python37\lib\site-packages\sagemaker\tensorflow\estimator.py", line 462, in fit_super
    super(TensorFlow, self).fit(inputs, wait, logs, job_name, experiment_config)
  File "C:\Program Files\Python37\lib\site-packages\sagemaker\estimator.py", line 494, in fit
    self.latest_training_job = _TrainingJob.start_new(self, inputs, experiment_config)
  File "C:\Program Files\Python37\lib\site-packages\sagemaker\estimator.py", line 1066, in start_new
    estimator.sagemaker_session.train(**train_args)
  File "C:\Program Files\Python37\lib\site-packages\sagemaker\session.py", line 590, in train
    self.sagemaker_client.create_training_job(**train_request)
  File "C:\Program Files\Python37\lib\site-packages\sagemaker\local\local_session.py", line 102, in create_training_job
    training_job.start(InputDataConfig, OutputDataConfig, hyperparameters, TrainingJobName)
  File "C:\Program Files\Python37\lib\site-packages\sagemaker\local\entities.py", line 96, in start
    input_data_config, output_data_config, hyperparameters, job_name
  File "C:\Program Files\Python37\lib\site-packages\sagemaker\local\image.py", line 174, in train
    self._cleanup(dirs_to_delete)
  File "C:\Program Files\Python37\lib\site-packages\sagemaker\local\image.py", line 597, in _cleanup
    _delete_tree(container_config_path)
  File "C:\Program Files\Python37\lib\site-packages\sagemaker\local\image.py", line 723, in _delete_tree
    shutil.rmtree(path)
  File "C:\Program Files\Python37\lib\shutil.py", line 516, in rmtree
    return _rmtree_unsafe(path, onerror)
  File "C:\Program Files\Python37\lib\shutil.py", line 395, in _rmtree_unsafe
    _rmtree_unsafe(fullname, onerror)
  File "C:\Program Files\Python37\lib\shutil.py", line 404, in _rmtree_unsafe
    onerror(os.rmdir, path, sys.exc_info())
  File "C:\Program Files\Python37\lib\shutil.py", line 402, in _rmtree_unsafe
    os.rmdir(path)
OSError: [WinError 145] The directory is not empty: 'C:\\Users\\NEKTAR~1\\AppData\\Local\\Temp\\tmpvy3d0546\\algo-1-sk9mf\\output'

Describe the bug A clear and concise description of what the bug is.

To reproduce A clear, step-by-step set of instructions to reproduce the bug.

Expected behavior A clear and concise description of what you expected to happen.

Screenshots or logs If applicable, add screenshots or logs to help explain your problem.

System information A description of your system.

  • Include the version of SageMaker Training Toolkit you are using.
  • If you are using a prebuilt Amazon SageMaker Docker image, provide the URL.
  • If you are using a custom Docker image, provide:
    • framework name (eg. PyTorch)
    • framework version
    • Python version
    • processing unit type (ie. CPU or GPU)

Additional context Add any other context about the problem here.

nectario avatar Jul 21 '20 13:07 nectario

@nectario what is your system setup? are you on a GPU instance?

laurenyu avatar Jul 21 '20 21:07 laurenyu

Yes, I am using a GPU on my local machine. It's the Nvidia Titan Volta. Python Version: 3.7 TensorFlow Version: 2.2.0

nectario avatar Jul 21 '20 21:07 nectario

ERROR: for algo-1-sk9mf  Cannot create container for service algo-1-sk9mf: Unknown runtime specified nvidia

I've usually seen this error when running a GPU image on a CPU instance, but since you're on a GPU, then I don't have any immediate ideas. Do you have nvidia-docker2 installed? Can you try restarting the docker daemon (as per this GitHub issue with a similar error message)? Are you able to run other GPU Docker images?

laurenyu avatar Jul 21 '20 23:07 laurenyu