sagemaker-python-sdk icon indicating copy to clipboard operation
sagemaker-python-sdk copied to clipboard

PyTorch 1.6 training image for SageMaker DDP not available in ECR

Open dficenec-aws opened this issue 4 years ago • 1 comments

Describe the bug The framework image returned by sagemaker.image_uris.retrieve for PyTorch 1.6, training, SageMaker DDP does not exist.

To reproduce In SageMaker notebook instance (kernel conda_pytorch_p36):

region = "us-east-1"
version = "1.6"
distribution = {'smdistributed':{'dataparallel':{'enabled': True}}}

sm_ddp_uri = sagemaker.image_uris.retrieve(
    framework="pytorch",
    region=region,
    version=version,
    py_version="py3",
    instance_type="ml.p3.16xlarge",
    image_scope="training",
    distribution=distribution,
)

print(sm_ddp_uri)

763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.6-gpu-py3-cu110-ubuntu18.04-v3

in SageMaker notebook instance terminal:

sh-4.2$ export REGION=us-east-1
sh-4.2$ aws ecr get-login-password --region $REGION |   docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-east-1.amazonaws.com
WARNING! Your password will be stored unencrypted in /home/ec2-user/.docker/config.json.
Configure a credential helper to remove this warning. See
https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded
sh-4.2$ docker pull 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.6-gpu-py3-cu110-ubuntu18.04-v3
Error response from daemon: manifest for 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.6-gpu-py3-cu110-ubuntu18.04-v3 not found: manifest unknown: Requested image not found

Expected behavior Image returned should be available. Image returned when distribution parameter is None is available.

Screenshots or logs If applicable, add screenshots or logs to help explain your problem.

System information A description of your system. Please provide:

  • SageMaker Python SDK version: 2.29.0
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): PyTorch
  • Framework version: 1.6
  • Python version: 3.6 (SM notebook instance kernel conda_pytorch_p36)
  • CPU or GPU: CPU
  • Custom Docker image (Y/N): Y

Additional context Add any other context about the problem here.

dficenec-aws avatar Mar 14 '21 18:03 dficenec-aws

Hey @dficenec-aws - can you help confirm if this issue still exist with the latest sagemaker?

akrishna1995 avatar Dec 28 '23 23:12 akrishna1995

Close this ticket since it is pending verification. Feel free to contact us if issue persists.

zhaoqizqwang avatar May 06 '24 16:05 zhaoqizqwang