sagemaker-pytorch-training-toolkit icon indicating copy to clipboard operation
sagemaker-pytorch-training-toolkit copied to clipboard

[FATAL tini (7)] exec train failed: No such file or directory

Open celsofranssa opened this issue 8 months ago • 0 comments

BUG Description I'm trying to automate and scale a large collection of experiments using AWS SageMamker via Python SDK. However, I am facing an error that does not give any direction to resolve it.

To reproduce

role = "arn:..."

    estimator = PyTorch(
        image_uri="1...ecr...amazonaws.com/...:prototype",
        entry_point="main.py",
        role=role,
        region="us-...",
        instance_type="ml...xlarge",
        instance_count=1,
        volume_size=225,
        hyperparameters=hparams
    )
    estimator.fit()

Expected behavior The model is expected to start to train and log metrics and losses.

Screenshots or logs

[2023-09-10 23:08:59,329][sagemaker][INFO] - Creating training-job with name: xmtc-2023-09-11-02-08-56-094
2023-09-11 02:09:00 Starting - Starting the training job...
2023-09-11 02:09:18 Starting - Preparing the instances for training......
2023-09-11 02:10:27 Downloading - Downloading input data
2023-09-11 02:10:27 Training - Downloading the training image..................
2023-09-11 02:13:33 Training - Training image download completed. Training in progress..[FATAL tini (7)] exec train failed: No such file or directory

2023-09-11 02:14:15 Uploading - Uploading generated training model
2023-09-11 02:14:15 Failed - Training job failed
Error executing job with overrides: []
Traceback (most recent call last):
  File "/home/celso/projects/LightningPrototype/run_on_sagemaker.py", line 32, in run_on_sagemaker
    estimator.fit()
  File "/home/celso/projects/LightningPrototype/venv/lib/python3.10/site-packages/sagemaker/workflow/pipeline_context.py", line 311, in wrapper
    return run_func(*args, **kwargs)
  File "/home/celso/projects/LightningPrototype/venv/lib/python3.10/site-packages/sagemaker/estimator.py", line 1292, in fit
    self.latest_training_job.wait(logs=logs)
  File "/home/celso/projects/LightningPrototype/venv/lib/python3.10/site-packages/sagemaker/estimator.py", line 2474, in wait
    self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
  File "/home/celso/projects/LightningPrototype/venv/lib/python3.10/site-packages/sagemaker/session.py", line 4849, in logs_for_job
    _logs_for_job(self.boto_session, job_name, wait, poll, log_type, timeout)
  File "/home/celso/projects/LightningPrototype/venv/lib/python3.10/site-packages/sagemaker/session.py", line 6760, in _logs_for_job
    _check_job_status(job_name, description, "TrainingJobStatus")
  File "/home/celso/projects/LightningPrototype/venv/lib/python3.10/site-packages/sagemaker/session.py", line 6813, in _check_job_status
    raise exceptions.UnexpectedStatusException(
sagemaker.exceptions.UnexpectedStatusException: Error for Training job xmtc-2023-09-11-02-08-56-094: Failed. Reason: AlgorithmError: , exit code: 127

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Process finished with exit code 1

System information A description of your system. Please provide:

  • SageMaker Python SDK version: sagemaker 2.177.1
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): Pytorch 2.0.1
  • Python version: Python 3.10
  • Custom Docker image (Y/N): Yes, on ECR.

celsofranssa avatar Oct 13 '23 16:10 celsofranssa