amazon-sagemaker-examples icon indicating copy to clipboard operation
amazon-sagemaker-examples copied to clipboard

Added examples for Distributed Data Parallel (DDP/SMDDP) training with PyTorch Lightning on Sagemaker.

Open 0x6b64 opened this issue 2 years ago • 31 comments

AWS SageMaker now supports PyTorch training (single node && distributed) using Lightning (https://pytorch-lightning.readthedocs.io/en/stable/). The blogpost with the announcement will be amended to this description once it has been released. In this change, we add examples of the variants of executing single/multi node training using Lightning and particularly for the SMDDP backend https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-intro.html).

Issue #, if available: None, the changes here are new self sufficient examples which don't interfere with the existing examples.

Description of changes:

The examples added demonstrate the following scenarios.

  1. ipython notebook added to demonstrate the execution of single node (different accelerator - CPU/GPU) based training.
  2. 6 distributed training jobs executed via a launcher script, which demonstrate how to execute multinode training with the DDP/SMDDP backends for the MNIST&BERT models. Examples are provided for both Strategy && Plugin architectures used in latest Lightning and well as older Lightning (1.5.10) which remains to be used by customers. DDPPlugin: https://github.com/Lightning-AI/lightning/blob/1.5.10/pytorch_lightning/plugins/training_type/ddp.py#L78 ; DDPStrategy: https://github.com/Lightning-AI/lightning/blob/master/src/pytorch_lightning/strategies/ddp.py#L79

Testing done:

In this changes, 7 examples have been added. All of them have been tested on the 570106654206 AWS account. The following are references to successful job executions. Moreover, the linter was executed to validate the stylistic content of the python notebook.

  1. ddp plugin mnist: https://us-west-2.console.aws.amazon.com/sagemaker/home?region=us-west-2#/jobs/lightning-ddp-plugin-2022-08-18-11-19-30-380
  2. ddp strategy bert: https://us-west-2.console.aws.amazon.com/sagemaker/home?region=us-west-2#/jobs/lightning-ddp-strategy-bert-2022-08-18-14-37-31-390
  3. ddp strategy mnist: https://us-west-2.console.aws.amazon.com/sagemaker/home?region=us-west-2#/jobs/lightning-ddp-strategy-mnist-2022-08-18-14-50-28-999
  4. smddp plugin mnist: https://us-west-2.console.aws.amazon.com/sagemaker/home?region=us-west-2#/jobs/lightning-smddp-plugin-mnist-2022-08-18-15-12-27-201
  5. smddp strategy bert: https://us-west-2.console.aws.amazon.com/sagemaker/home?region=us-west-2#/jobs/lightning-smddp-strategy-bert-2022-08-18-17-14-27-434
  6. smddp strategy mnist: https://us-west-2.console.aws.amazon.com/sagemaker/home?region=us-west-2#/jobs/lightning-smddp-strategy-mnist-2022-08-18-15-32-52-541
  7. single node mnist: Validated correctness in https://dhimank-dev.notebook.us-west-2.sagemaker.aws/notebooks/lightning/pytorch-lightning-mnist-single-node.ipynb

Merge Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.

  • [x] I have read the CONTRIBUTING doc and adhered to the example notebook best practices
  • [x] I have updated any necessary documentation, including READMEs
  • [x] I have tested my notebook(s) and ensured it runs end-to-end
  • [x] I have linted my notebook(s) and code using tox -e black-format,black-nb-format

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

0x6b64 avatar Aug 02 '22 14:08 0x6b64

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-examples-code-formatting
  • Commit ID: d74966e91a3dedc8864ba3c39e775f674cc58803
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot avatar Aug 02 '22 15:08 sagemaker-bot

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-examples-link-check
  • Commit ID: d74966e91a3dedc8864ba3c39e775f674cc58803
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot avatar Aug 02 '22 15:08 sagemaker-bot

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-examples-code-formatting
  • Commit ID: d74966e91a3dedc8864ba3c39e775f674cc58803
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot avatar Aug 02 '22 15:08 sagemaker-bot

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-examples-code-formatting
  • Commit ID: d74966e91a3dedc8864ba3c39e775f674cc58803
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot avatar Aug 02 '22 15:08 sagemaker-bot

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-examples-link-check
  • Commit ID: d74966e91a3dedc8864ba3c39e775f674cc58803
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot avatar Aug 02 '22 15:08 sagemaker-bot

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-examples-link-check
  • Commit ID: d74966e91a3dedc8864ba3c39e775f674cc58803
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot avatar Aug 02 '22 15:08 sagemaker-bot

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-examples-grammar
  • Commit ID: d74966e91a3dedc8864ba3c39e775f674cc58803
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot avatar Aug 02 '22 15:08 sagemaker-bot

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-examples-grammar
  • Commit ID: d74966e91a3dedc8864ba3c39e775f674cc58803
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot avatar Aug 02 '22 15:08 sagemaker-bot

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-examples-grammar
  • Commit ID: d74966e91a3dedc8864ba3c39e775f674cc58803
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot avatar Aug 02 '22 15:08 sagemaker-bot

AWS CodeBuild CI Report

  • CodeBuild project: amazon-sagemaker-examples-pr
  • Commit ID: d74966e91a3dedc8864ba3c39e775f674cc58803
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot avatar Aug 02 '22 15:08 sagemaker-bot

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-examples-link-check
  • Commit ID: cc922b45fa372695ae1a76ea79cc0cf11c6d30e6
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot avatar Aug 18 '22 15:08 sagemaker-bot

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-examples-grammar
  • Commit ID: cc922b45fa372695ae1a76ea79cc0cf11c6d30e6
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot avatar Aug 18 '22 15:08 sagemaker-bot

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-examples-code-formatting
  • Commit ID: cc922b45fa372695ae1a76ea79cc0cf11c6d30e6
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot avatar Aug 18 '22 15:08 sagemaker-bot

AWS CodeBuild CI Report

  • CodeBuild project: amazon-sagemaker-examples-pr
  • Commit ID: cc922b45fa372695ae1a76ea79cc0cf11c6d30e6
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot avatar Aug 18 '22 15:08 sagemaker-bot

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-examples-code-formatting
  • Commit ID: 50ce44604cded8d2553697ba1ca7d00547b813f2
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot avatar Aug 18 '22 15:08 sagemaker-bot

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-examples-grammar
  • Commit ID: 50ce44604cded8d2553697ba1ca7d00547b813f2
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot avatar Aug 18 '22 15:08 sagemaker-bot

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-examples-link-check
  • Commit ID: 50ce44604cded8d2553697ba1ca7d00547b813f2
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot avatar Aug 18 '22 15:08 sagemaker-bot

AWS CodeBuild CI Report

  • CodeBuild project: amazon-sagemaker-examples-pr
  • Commit ID: 50ce44604cded8d2553697ba1ca7d00547b813f2
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot avatar Aug 18 '22 15:08 sagemaker-bot

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-examples-code-formatting
  • Commit ID: d965052e190638ce41f5edfc9032a16b7cfeb207
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot avatar Aug 18 '22 15:08 sagemaker-bot

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-examples-grammar
  • Commit ID: d965052e190638ce41f5edfc9032a16b7cfeb207
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot avatar Aug 18 '22 15:08 sagemaker-bot

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-examples-link-check
  • Commit ID: d965052e190638ce41f5edfc9032a16b7cfeb207
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot avatar Aug 18 '22 15:08 sagemaker-bot

AWS CodeBuild CI Report

  • CodeBuild project: amazon-sagemaker-examples-pr
  • Commit ID: d965052e190638ce41f5edfc9032a16b7cfeb207
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot avatar Aug 18 '22 16:08 sagemaker-bot

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-examples-link-check
  • Commit ID: 757e8d0cb3f3bb6bc69b0325e3d8483de02593a5
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot avatar Aug 18 '22 17:08 sagemaker-bot

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-examples-code-formatting
  • Commit ID: 757e8d0cb3f3bb6bc69b0325e3d8483de02593a5
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot avatar Aug 18 '22 17:08 sagemaker-bot

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-examples-grammar
  • Commit ID: 757e8d0cb3f3bb6bc69b0325e3d8483de02593a5
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot avatar Aug 18 '22 17:08 sagemaker-bot

AWS CodeBuild CI Report

  • CodeBuild project: amazon-sagemaker-examples-pr
  • Commit ID: 757e8d0cb3f3bb6bc69b0325e3d8483de02593a5
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot avatar Aug 18 '22 18:08 sagemaker-bot

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-examples-link-check
  • Commit ID: 226345d23d9f821718f4b04a5b2ff6b5ba189128
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot avatar Aug 18 '22 18:08 sagemaker-bot

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-examples-code-formatting
  • Commit ID: 226345d23d9f821718f4b04a5b2ff6b5ba189128
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot avatar Aug 18 '22 18:08 sagemaker-bot

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-examples-grammar
  • Commit ID: 226345d23d9f821718f4b04a5b2ff6b5ba189128
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot avatar Aug 18 '22 18:08 sagemaker-bot

AWS CodeBuild CI Report

  • CodeBuild project: amazon-sagemaker-examples-pr
  • Commit ID: 226345d23d9f821718f4b04a5b2ff6b5ba189128
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot avatar Aug 18 '22 19:08 sagemaker-bot