sagemaker-pytorch-training-toolkit
sagemaker-pytorch-training-toolkit copied to clipboard
"Train": executable file not found in $PATH
BUG Description I am facing an error that does not give any direction to resolve it when migrating to run on Sagemaker.
The code runs perfectly on the local machine.
To reproduce
role = "arn:..."
estimator = PyTorch(
image_uri="1...ecr...amazonaws.com/...:prototype",
git_config={"repo": "https://github.com/celsofranssa/LightningPrototype.git", "branch": "sagemaker"},
entry_point="main.py",
role=role,
region="us-...",
instance_type="local", # ml.g4dn.2xlarge
instance_count=1,
volume_size=225,
hyperparameters=hparams
)
estimator.fit()
Expected behavior The model is expected to start to train and log metrics and losses.
Screenshots or logs
Cloning into '/tmp/tmpycpzvkcn'...
remote: Enumerating objects: 246, done.
remote: Counting objects: 100% (246/246), done.
remote: Compressing objects: 100% (190/190), done.
remote: Total 246 (delta 40), reused 232 (delta 29), pack-reused 0
Receiving objects: 100% (246/246), 39.10 MiB | 27.69 MiB/s, done.
Resolving deltas: 100% (40/40), done.
Branch 'sagemaker' set up to track remote branch 'sagemaker' from 'origin'.
Switched to a new branch 'sagemaker'
[2023-10-12 19:22:15,073][sagemaker][INFO] - Creating training-job with name: xmtc-2023-10-13-02-22-09-781
[2023-10-12 19:22:15,116][sagemaker.local.image][INFO] - 'Docker Compose' found using Docker CLI.
[2023-10-12 19:22:15,117][sagemaker.local.local_session][INFO] - Starting training job
[2023-10-12 19:22:15,118][sagemaker.local.image][INFO] - Using the long-lived AWS credentials found in session
[2023-10-12 19:22:15,121][sagemaker.local.image][INFO] - docker compose file:
networks:
sagemaker-local:
name: sagemaker-local
services:
algo-1-55row:
command: train
container_name: 1l7x1nzly6-algo-1-55row
environment:
- '[Masked]'
- '[Masked]'
- '[Masked]'
- '[Masked]'
- '[Masked]'
image: 179395270822.dkr.ecr.us-east-2.amazonaws.com/xmtc:prototype
networks:
sagemaker-local:
aliases:
- algo-1-55row
stdin_open: true
tty: true
volumes:
- /tmp/tmpsvd2b_wm/algo-1-55row/output/data:/opt/ml/output/data
- /tmp/tmpsvd2b_wm/algo-1-55row/input:/opt/ml/input
- /tmp/tmpsvd2b_wm/algo-1-55row/output:/opt/ml/output
- /tmp/tmpsvd2b_wm/model:/opt/ml/model
version: '2.3'
[2023-10-12 19:22:15,121][sagemaker.local.image][INFO] - docker command: docker compose -f /tmp/tmpsvd2b_wm/docker-compose.yaml up --build --abort-on-container-exit
time="2023-10-12T19:22:15-07:00" level=warning msg="a network with name sagemaker-local exists but was not created for project \"tmpsvd2b_wm\".\nSet `external: true` to use an existing network"
Container 1l7x1nzly6-algo-1-55row Creating
Container 1l7x1nzly6-algo-1-55row Created
Attaching to 1l7x1nzly6-algo-1-55row
Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "train": executable file not found in $PATH: unknown
Error executing job with overrides: []
Traceback (most recent call last):
File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/local/image.py", line 296, in train
_stream_output(process)
File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/local/image.py", line 984, in _stream_output
raise RuntimeError("Process exited with code: %s" % exit_code)
RuntimeError: Process exited with code: 1
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "run_on_sagemaker.py", line 28, in run_on_sagemaker
estimator.fit()
File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/workflow/pipeline_context.py", line 311, in wrapper
return run_func(*args, **kwargs)
File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/estimator.py", line 1311, in fit
self.latest_training_job = _TrainingJob.start_new(self, inputs, experiment_config)
File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/estimator.py", line 2374, in start_new
estimator.sagemaker_session.train(**train_args)
File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/session.py", line 941, in train
self._intercept_create_request(train_request, submit, self.train.__name__)
File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/session.py", line 5618, in _intercept_create_request
return create(request)
File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/session.py", line 939, in submit
self.sagemaker_client.create_training_job(**request)
File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/local/local_session.py", line 203, in create_training_job
training_job.start(
File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/local/entities.py", line 243, in start
self.model_artifacts = self.container.train(
File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/local/image.py", line 301, in train
raise RuntimeError(msg)
RuntimeError: Failed to run: ['docker', 'compose', '-f', '/tmp/tmpsvd2b_wm/docker-compose.yaml', 'up', '--build', '--abort-on-container-exit'], Process exited with code: 1
System information A description of your system. Please provide:
- SageMaker Python SDK version: sagemaker 2.192.0
- Framework name (eg. PyTorch) or algorithm (eg. KMeans): Pytorch 2.0.1
- Python version: Python 3.10
- Docker: 24.0.6
- Custom Docker image (Y/N): Yes, on ECR.