sagemaker-pytorch-training-toolkit icon indicating copy to clipboard operation
sagemaker-pytorch-training-toolkit copied to clipboard

"Train": executable file not found in $PATH

Open celsofranssa opened this issue 8 months ago • 0 comments

BUG Description I am facing an error that does not give any direction to resolve it when migrating to run on Sagemaker.

The code runs perfectly on the local machine.

To reproduce

role = "arn:..."

    estimator = PyTorch(
        image_uri="1...ecr...amazonaws.com/...:prototype",
        git_config={"repo": "https://github.com/celsofranssa/LightningPrototype.git", "branch": "sagemaker"},
        entry_point="main.py",
        role=role,
        region="us-...",
        instance_type="local", # ml.g4dn.2xlarge
        instance_count=1,
        volume_size=225,
        hyperparameters=hparams
    )
    estimator.fit()

Expected behavior The model is expected to start to train and log metrics and losses.

Screenshots or logs

Cloning into '/tmp/tmpycpzvkcn'...
remote: Enumerating objects: 246, done.
remote: Counting objects: 100% (246/246), done.
remote: Compressing objects: 100% (190/190), done.
remote: Total 246 (delta 40), reused 232 (delta 29), pack-reused 0
Receiving objects: 100% (246/246), 39.10 MiB | 27.69 MiB/s, done.
Resolving deltas: 100% (40/40), done.
Branch 'sagemaker' set up to track remote branch 'sagemaker' from 'origin'.
Switched to a new branch 'sagemaker'
[2023-10-12 19:22:15,073][sagemaker][INFO] - Creating training-job with name: xmtc-2023-10-13-02-22-09-781
[2023-10-12 19:22:15,116][sagemaker.local.image][INFO] - 'Docker Compose' found using Docker CLI.
[2023-10-12 19:22:15,117][sagemaker.local.local_session][INFO] - Starting training job
[2023-10-12 19:22:15,118][sagemaker.local.image][INFO] - Using the long-lived AWS credentials found in session
[2023-10-12 19:22:15,121][sagemaker.local.image][INFO] - docker compose file: 
networks:
  sagemaker-local:
    name: sagemaker-local
services:
  algo-1-55row:
    command: train
    container_name: 1l7x1nzly6-algo-1-55row
    environment:
    - '[Masked]'
    - '[Masked]'
    - '[Masked]'
    - '[Masked]'
    - '[Masked]'
    image: 179395270822.dkr.ecr.us-east-2.amazonaws.com/xmtc:prototype
    networks:
      sagemaker-local:
        aliases:
        - algo-1-55row
    stdin_open: true
    tty: true
    volumes:
    - /tmp/tmpsvd2b_wm/algo-1-55row/output/data:/opt/ml/output/data
    - /tmp/tmpsvd2b_wm/algo-1-55row/input:/opt/ml/input
    - /tmp/tmpsvd2b_wm/algo-1-55row/output:/opt/ml/output
    - /tmp/tmpsvd2b_wm/model:/opt/ml/model
version: '2.3'

[2023-10-12 19:22:15,121][sagemaker.local.image][INFO] - docker command: docker compose -f /tmp/tmpsvd2b_wm/docker-compose.yaml up --build --abort-on-container-exit
time="2023-10-12T19:22:15-07:00" level=warning msg="a network with name sagemaker-local exists but was not created for project \"tmpsvd2b_wm\".\nSet `external: true` to use an existing network"
 Container 1l7x1nzly6-algo-1-55row  Creating
 Container 1l7x1nzly6-algo-1-55row  Created
Attaching to 1l7x1nzly6-algo-1-55row
Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "train": executable file not found in $PATH: unknown
Error executing job with overrides: []
Traceback (most recent call last):
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/local/image.py", line 296, in train
    _stream_output(process)
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/local/image.py", line 984, in _stream_output
    raise RuntimeError("Process exited with code: %s" % exit_code)
RuntimeError: Process exited with code: 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "run_on_sagemaker.py", line 28, in run_on_sagemaker
    estimator.fit()
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/workflow/pipeline_context.py", line 311, in wrapper
    return run_func(*args, **kwargs)
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/estimator.py", line 1311, in fit
    self.latest_training_job = _TrainingJob.start_new(self, inputs, experiment_config)
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/estimator.py", line 2374, in start_new
    estimator.sagemaker_session.train(**train_args)
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/session.py", line 941, in train
    self._intercept_create_request(train_request, submit, self.train.__name__)
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/session.py", line 5618, in _intercept_create_request
    return create(request)
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/session.py", line 939, in submit
    self.sagemaker_client.create_training_job(**request)
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/local/local_session.py", line 203, in create_training_job
    training_job.start(
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/local/entities.py", line 243, in start
    self.model_artifacts = self.container.train(
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/local/image.py", line 301, in train
    raise RuntimeError(msg)
RuntimeError: Failed to run: ['docker', 'compose', '-f', '/tmp/tmpsvd2b_wm/docker-compose.yaml', 'up', '--build', '--abort-on-container-exit'], Process exited with code: 1


System information A description of your system. Please provide:

  • SageMaker Python SDK version: sagemaker 2.192.0
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): Pytorch 2.0.1
  • Python version: Python 3.10
  • Docker: 24.0.6
  • Custom Docker image (Y/N): Yes, on ECR.

celsofranssa avatar Oct 13 '23 16:10 celsofranssa