amazon-sagemaker-examples
amazon-sagemaker-examples copied to clipboard
[Content Improvement] Update instance types in pytorch_smdataparallel_mnist_demo
Link to the notebook pytorch_smdataparallel_mnist_demo
What aspects of the notebook can be improved?
This notebook is not working anymore with ml.p3dn.24xlarge
instances : botocore.exceptions.ClientError: An error occurred (ValidationException) when calling the CreateTrainingJob operation: Unsupported instance type ml.p3dn.24xlarge
What are your suggestions?
Don't suggest ml.p3dn.24xlarge
as a recommended instance
I was able to successfully run this notebook using both the suggested ml.p3dn.24xlarge
and ml.p4d.24xlarge
instances in us-west-2. Which region are you using? Did you make any other code changes?
I am in the eu-central-1
(Frankfurt) region. I haven't changed the code.
I have the same problem in the same region as you with the following notebook: https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/data_parallel/efficientnet/pytorch_smdataparallel_efficientnet_demo.ipynb
If I use the other recommended instance, the result is no better: "UnexpectedStatusException: Error for Training job pt-smddp-efficientnet-b0-2p4d: Failed. Reason: ClientError: Requested instances are not available in these availability zones: [eu-central-1a]. Please try again with subnets having sufficient address space from a different AZ."
Interestingly, in the meantime, a badge saying "skipped" has been added to the notebook for all instances: https://github.com/aws/amazon-sagemaker-examples/blob/22b8203af35d91a1cbeb9a4d3c9c781ac74b24d6/training/distributed_training/pytorch/data_parallel/efficientnet/pytorch_smdataparallel_efficientnet_demo.ipynb?short_path=fce2eeb#L541
Apparently, you have to store the data in a file system on a different subnet. But which of the three to choose, you have to guess.