sagemaker-python-sdk
sagemaker-python-sdk copied to clipboard
PyTorch 1.6 training image for SageMaker DDP not available in ECR
Describe the bug The framework image returned by sagemaker.image_uris.retrieve for PyTorch 1.6, training, SageMaker DDP does not exist.
To reproduce In SageMaker notebook instance (kernel conda_pytorch_p36):
region = "us-east-1"
version = "1.6"
distribution = {'smdistributed':{'dataparallel':{'enabled': True}}}
sm_ddp_uri = sagemaker.image_uris.retrieve(
framework="pytorch",
region=region,
version=version,
py_version="py3",
instance_type="ml.p3.16xlarge",
image_scope="training",
distribution=distribution,
)
print(sm_ddp_uri)
763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.6-gpu-py3-cu110-ubuntu18.04-v3
in SageMaker notebook instance terminal:
sh-4.2$ export REGION=us-east-1
sh-4.2$ aws ecr get-login-password --region $REGION | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-east-1.amazonaws.com
WARNING! Your password will be stored unencrypted in /home/ec2-user/.docker/config.json.
Configure a credential helper to remove this warning. See
https://docs.docker.com/engine/reference/commandline/login/#credentials-store
Login Succeeded
sh-4.2$ docker pull 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.6-gpu-py3-cu110-ubuntu18.04-v3
Error response from daemon: manifest for 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.6-gpu-py3-cu110-ubuntu18.04-v3 not found: manifest unknown: Requested image not found
Expected behavior Image returned should be available. Image returned when distribution parameter is None is available.
Screenshots or logs If applicable, add screenshots or logs to help explain your problem.
System information A description of your system. Please provide:
- SageMaker Python SDK version: 2.29.0
- Framework name (eg. PyTorch) or algorithm (eg. KMeans): PyTorch
- Framework version: 1.6
- Python version: 3.6 (SM notebook instance kernel conda_pytorch_p36)
- CPU or GPU: CPU
- Custom Docker image (Y/N): Y
Additional context Add any other context about the problem here.
Hey @dficenec-aws - can you help confirm if this issue still exist with the latest sagemaker?
Close this ticket since it is pending verification. Feel free to contact us if issue persists.