sagemaker-pytorch-training-toolkit
sagemaker-pytorch-training-toolkit copied to clipboard
[FATAL tini (7)] exec train failed: No such file or directory
BUG Description I'm trying to automate and scale a large collection of experiments using AWS SageMamker via Python SDK. However, I am facing an error that does not give any direction to resolve it.
To reproduce
role = "arn:..."
estimator = PyTorch(
image_uri="1...ecr...amazonaws.com/...:prototype",
entry_point="main.py",
role=role,
region="us-...",
instance_type="ml...xlarge",
instance_count=1,
volume_size=225,
hyperparameters=hparams
)
estimator.fit()
Expected behavior The model is expected to start to train and log metrics and losses.
Screenshots or logs
[2023-09-10 23:08:59,329][sagemaker][INFO] - Creating training-job with name: xmtc-2023-09-11-02-08-56-094
2023-09-11 02:09:00 Starting - Starting the training job...
2023-09-11 02:09:18 Starting - Preparing the instances for training......
2023-09-11 02:10:27 Downloading - Downloading input data
2023-09-11 02:10:27 Training - Downloading the training image..................
2023-09-11 02:13:33 Training - Training image download completed. Training in progress..[FATAL tini (7)] exec train failed: No such file or directory
2023-09-11 02:14:15 Uploading - Uploading generated training model
2023-09-11 02:14:15 Failed - Training job failed
Error executing job with overrides: []
Traceback (most recent call last):
File "/home/celso/projects/LightningPrototype/run_on_sagemaker.py", line 32, in run_on_sagemaker
estimator.fit()
File "/home/celso/projects/LightningPrototype/venv/lib/python3.10/site-packages/sagemaker/workflow/pipeline_context.py", line 311, in wrapper
return run_func(*args, **kwargs)
File "/home/celso/projects/LightningPrototype/venv/lib/python3.10/site-packages/sagemaker/estimator.py", line 1292, in fit
self.latest_training_job.wait(logs=logs)
File "/home/celso/projects/LightningPrototype/venv/lib/python3.10/site-packages/sagemaker/estimator.py", line 2474, in wait
self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
File "/home/celso/projects/LightningPrototype/venv/lib/python3.10/site-packages/sagemaker/session.py", line 4849, in logs_for_job
_logs_for_job(self.boto_session, job_name, wait, poll, log_type, timeout)
File "/home/celso/projects/LightningPrototype/venv/lib/python3.10/site-packages/sagemaker/session.py", line 6760, in _logs_for_job
_check_job_status(job_name, description, "TrainingJobStatus")
File "/home/celso/projects/LightningPrototype/venv/lib/python3.10/site-packages/sagemaker/session.py", line 6813, in _check_job_status
raise exceptions.UnexpectedStatusException(
sagemaker.exceptions.UnexpectedStatusException: Error for Training job xmtc-2023-09-11-02-08-56-094: Failed. Reason: AlgorithmError: , exit code: 127
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Process finished with exit code 1
System information A description of your system. Please provide:
- SageMaker Python SDK version: sagemaker 2.177.1
- Framework name (eg. PyTorch) or algorithm (eg. KMeans): Pytorch 2.0.1
- Python version: Python 3.10
- Custom Docker image (Y/N): Yes, on ECR.