amazon-sagemaker-examples icon indicating copy to clipboard operation
amazon-sagemaker-examples copied to clipboard

[Content Improvement] Update instance types in pytorch_smdataparallel_mnist_demo

Open enric1994 opened this issue 1 year ago • 2 comments

Link to the notebook pytorch_smdataparallel_mnist_demo

What aspects of the notebook can be improved? This notebook is not working anymore with ml.p3dn.24xlarge instances : botocore.exceptions.ClientError: An error occurred (ValidationException) when calling the CreateTrainingJob operation: Unsupported instance type ml.p3dn.24xlarge

What are your suggestions? Don't suggest ml.p3dn.24xlarge as a recommended instance

enric1994 avatar Jul 05 '22 11:07 enric1994

I was able to successfully run this notebook using both the suggested ml.p3dn.24xlarge and ml.p4d.24xlarge instances in us-west-2. Which region are you using? Did you make any other code changes?

jkroll-aws avatar Jul 05 '22 16:07 jkroll-aws

I am in the eu-central-1 (Frankfurt) region. I haven't changed the code.

enric1994 avatar Jul 06 '22 08:07 enric1994

I have the same problem in the same region as you with the following notebook: https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/data_parallel/efficientnet/pytorch_smdataparallel_efficientnet_demo.ipynb

saskra avatar Apr 26 '23 06:04 saskra

If I use the other recommended instance, the result is no better: "UnexpectedStatusException: Error for Training job pt-smddp-efficientnet-b0-2p4d: Failed. Reason: ClientError: Requested instances are not available in these availability zones: [eu-central-1a]. Please try again with subnets having sufficient address space from a different AZ."

Interestingly, in the meantime, a badge saying "skipped" has been added to the notebook for all instances: https://github.com/aws/amazon-sagemaker-examples/blob/22b8203af35d91a1cbeb9a4d3c9c781ac74b24d6/training/distributed_training/pytorch/data_parallel/efficientnet/pytorch_smdataparallel_efficientnet_demo.ipynb?short_path=fce2eeb#L541

saskra avatar Apr 27 '23 10:04 saskra

Apparently, you have to store the data in a file system on a different subnet. But which of the three to choose, you have to guess.

saskra avatar May 08 '23 12:05 saskra