amazon-sagemaker-examples icon indicating copy to clipboard operation
amazon-sagemaker-examples copied to clipboard

Adding Heterogeneous Clusters example for TensorFlow and PyTorch

Open gilinachum opened this issue 3 years ago • 5 comments

Description of changes: Heterogeneous Clusters for Amazon SageMaker model training was announced July 2022. We're providing extensive code examples for using the feature with TensorFlow and PyTorch.

Testing done: Run the TensorFlow and PyTorch notebooks (2 in total).

Merge Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.

  • [x] I have read the CONTRIBUTING doc and adhered to the example notebook best practices
  • [x] I have updated any necessary documentation, including READMEs
  • [x] I have tested my notebook(s) and ensured it runs end-to-end
  • [x] I have linted my notebook(s) and code using black-nb -l 100 {path}/{notebook-name}.ipynb

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

gilinachum avatar Sep 13 '22 21:09 gilinachum

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-examples-code-formatting
  • Commit ID: 06da2ce645af25160daba9a97feba38aff1e074e
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot avatar Sep 13 '22 21:09 sagemaker-bot

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-examples-link-check
  • Commit ID: 06da2ce645af25160daba9a97feba38aff1e074e
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot avatar Sep 13 '22 21:09 sagemaker-bot

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-examples-grammar
  • Commit ID: 06da2ce645af25160daba9a97feba38aff1e074e
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot avatar Sep 13 '22 21:09 sagemaker-bot

AWS CodeBuild CI Report

  • CodeBuild project: amazon-sagemaker-examples-pr
  • Commit ID: 06da2ce645af25160daba9a97feba38aff1e074e
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot avatar Sep 13 '22 22:09 sagemaker-bot

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-examples-code-formatting
  • Commit ID: bd479c691a6d8bf2184c8d3228073a14006a672d
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot avatar Sep 24 '22 00:09 sagemaker-bot

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-examples-link-check
  • Commit ID: bd479c691a6d8bf2184c8d3228073a14006a672d
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot avatar Sep 24 '22 00:09 sagemaker-bot

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-examples-grammar
  • Commit ID: bd479c691a6d8bf2184c8d3228073a14006a672d
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot avatar Sep 24 '22 00:09 sagemaker-bot

AWS CodeBuild CI Report

  • CodeBuild project: amazon-sagemaker-examples-pr
  • Commit ID: bd479c691a6d8bf2184c8d3228073a14006a672d
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot avatar Sep 24 '22 01:09 sagemaker-bot

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-examples-link-check
  • Commit ID: fe7f8ce037cf78bd2c2fc926b953c5a2291b3ae3
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot avatar Sep 24 '22 11:09 sagemaker-bot

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-examples-code-formatting
  • Commit ID: fe7f8ce037cf78bd2c2fc926b953c5a2291b3ae3
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot avatar Sep 24 '22 11:09 sagemaker-bot

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-examples-grammar
  • Commit ID: fe7f8ce037cf78bd2c2fc926b953c5a2291b3ae3
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot avatar Sep 24 '22 11:09 sagemaker-bot

AWS CodeBuild CI Report

  • CodeBuild project: amazon-sagemaker-examples-pr
  • Commit ID: fe7f8ce037cf78bd2c2fc926b953c5a2291b3ae3
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot avatar Sep 24 '22 12:09 sagemaker-bot

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-examples-link-check
  • Commit ID: 9004b214fe5a1ccea0c64b1e4419784ae1979594
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot avatar Sep 24 '22 17:09 sagemaker-bot

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-examples-code-formatting
  • Commit ID: 9004b214fe5a1ccea0c64b1e4419784ae1979594
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot avatar Sep 24 '22 17:09 sagemaker-bot

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-examples-grammar
  • Commit ID: 9004b214fe5a1ccea0c64b1e4419784ae1979594
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot avatar Sep 24 '22 17:09 sagemaker-bot

AWS CodeBuild CI Report

  • CodeBuild project: amazon-sagemaker-examples-pr
  • Commit ID: 9004b214fe5a1ccea0c64b1e4419784ae1979594
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot avatar Sep 24 '22 19:09 sagemaker-bot

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-examples-link-check
  • Commit ID: 38ef3f5d50cea2cabd3eee786952f359101a32e8
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot avatar Sep 26 '22 10:09 sagemaker-bot

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-examples-code-formatting
  • Commit ID: 38ef3f5d50cea2cabd3eee786952f359101a32e8
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot avatar Sep 26 '22 10:09 sagemaker-bot

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-examples-grammar
  • Commit ID: 38ef3f5d50cea2cabd3eee786952f359101a32e8
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot avatar Sep 26 '22 10:09 sagemaker-bot

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-examples-code-formatting
  • Commit ID: d47542b19ee628ab4d0c4003952fcf44e01d062c
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot avatar Sep 26 '22 10:09 sagemaker-bot

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-examples-link-check
  • Commit ID: d47542b19ee628ab4d0c4003952fcf44e01d062c
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot avatar Sep 26 '22 10:09 sagemaker-bot

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-examples-grammar
  • Commit ID: d47542b19ee628ab4d0c4003952fcf44e01d062c
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot avatar Sep 26 '22 10:09 sagemaker-bot

AWS CodeBuild CI Report

  • CodeBuild project: amazon-sagemaker-examples-pr
  • Commit ID: d47542b19ee628ab4d0c4003952fcf44e01d062c
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot avatar Sep 26 '22 11:09 sagemaker-bot

Very extensive example! I'm concerned about testing it though. Some of the example is just in readme which isn't testable. Also, there's hardcoding of region to us-east-1 and of instance types which will fail when we test in other regions. There's a lot of security issues that the bots have picked up, but I'd also say if you're setting up network comms like this, you need a security section in the overview of the solution(s). With the complexity of the solution, I can't help but think 1) could a lot of this be done in a container to reduce the amount of noise and complexity for someone trying this out; 2) if untestable across regions, then should it be moved to aws-samples and we can simply link to it from the examples website (with caveats that it goes untested). If you can work through the security issues, and solve the problem for several/many/most commercial regions, then I think we could proceed. But locking it to us-east-1: I don't think is great customer experience.

Thank you.@aaronmarkham

  • Testing all code - Except for the root level readme we won't have readme files (we'll link directly to notebooks). We'll also make sure all of the runnable code runs as part of these notebooks. I'd like to keep the solution in SM Examples repo since it's the best, and currently the only, representation to the Heterogeneous clusters feature of SM Training.
  • Security communication channels - We'll add a section to the main readme that explains the network connections between the instances, port numbers and their purpose. Unlike inference, which is exposed (directly or indirectly) to end users, this is a training workload, where all the software is controlled by the data scientists that runs training. The solution communicates between training instances (all ports communication between instances is allowed by default), just like other processes that takes place during distributed training (MPI, DDP, Horovod, SSH, etc). We don't add IAM rules, security groups, VPC, etc.
  • Security creating processes - This CodeGuru warnings are out of context, as the added code acts as a proxy script and re-executes the data scientists commands, same commands that are executed for any training job. There's no end-user input like in webapp scenario. Normally: user --cmd--> [train.py] With launcher.py acting as a proxy: user --cmd--> [Launcher.py] --cmd+extra_arg--> [train.py]

gilinachum avatar Sep 28 '22 12:09 gilinachum

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-examples-link-check
  • Commit ID: 3415bd9c07f86b48bbad0fb6d03989ee046d15c8
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot avatar Oct 04 '22 14:10 sagemaker-bot

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-examples-code-formatting
  • Commit ID: 3415bd9c07f86b48bbad0fb6d03989ee046d15c8
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot avatar Oct 04 '22 14:10 sagemaker-bot

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-examples-grammar
  • Commit ID: 3415bd9c07f86b48bbad0fb6d03989ee046d15c8
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot avatar Oct 04 '22 14:10 sagemaker-bot

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-examples-code-formatting
  • Commit ID: 0490d31916805e6d4f3e54d129a7a3f2d7391dc9
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot avatar Oct 04 '22 14:10 sagemaker-bot

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-examples-link-check
  • Commit ID: 0490d31916805e6d4f3e54d129a7a3f2d7391dc9
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot avatar Oct 04 '22 14:10 sagemaker-bot