amazon-sagemaker-examples
amazon-sagemaker-examples copied to clipboard
Adding Heterogeneous Clusters example for TensorFlow and PyTorch
Description of changes: Heterogeneous Clusters for Amazon SageMaker model training was announced July 2022. We're providing extensive code examples for using the feature with TensorFlow and PyTorch.
Testing done: Run the TensorFlow and PyTorch notebooks (2 in total).
Merge Checklist
Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.
- [x] I have read the CONTRIBUTING doc and adhered to the example notebook best practices
- [x] I have updated any necessary documentation, including READMEs
- [x] I have tested my notebook(s) and ensured it runs end-to-end
- [x] I have linted my notebook(s) and code using
black-nb -l 100 {path}/{notebook-name}.ipynb
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
Check out this pull request on ![]()
See visual diffs & provide feedback on Jupyter Notebooks.
Powered by ReviewNB
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-examples-code-formatting
- Commit ID: 06da2ce645af25160daba9a97feba38aff1e074e
- Result: FAILED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-examples-link-check
- Commit ID: 06da2ce645af25160daba9a97feba38aff1e074e
- Result: SUCCEEDED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-examples-grammar
- Commit ID: 06da2ce645af25160daba9a97feba38aff1e074e
- Result: FAILED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: amazon-sagemaker-examples-pr
- Commit ID: 06da2ce645af25160daba9a97feba38aff1e074e
- Result: FAILED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-examples-code-formatting
- Commit ID: bd479c691a6d8bf2184c8d3228073a14006a672d
- Result: FAILED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-examples-link-check
- Commit ID: bd479c691a6d8bf2184c8d3228073a14006a672d
- Result: SUCCEEDED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-examples-grammar
- Commit ID: bd479c691a6d8bf2184c8d3228073a14006a672d
- Result: FAILED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: amazon-sagemaker-examples-pr
- Commit ID: bd479c691a6d8bf2184c8d3228073a14006a672d
- Result: FAILED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-examples-link-check
- Commit ID: fe7f8ce037cf78bd2c2fc926b953c5a2291b3ae3
- Result: SUCCEEDED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-examples-code-formatting
- Commit ID: fe7f8ce037cf78bd2c2fc926b953c5a2291b3ae3
- Result: FAILED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-examples-grammar
- Commit ID: fe7f8ce037cf78bd2c2fc926b953c5a2291b3ae3
- Result: FAILED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: amazon-sagemaker-examples-pr
- Commit ID: fe7f8ce037cf78bd2c2fc926b953c5a2291b3ae3
- Result: SUCCEEDED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-examples-link-check
- Commit ID: 9004b214fe5a1ccea0c64b1e4419784ae1979594
- Result: SUCCEEDED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-examples-code-formatting
- Commit ID: 9004b214fe5a1ccea0c64b1e4419784ae1979594
- Result: FAILED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-examples-grammar
- Commit ID: 9004b214fe5a1ccea0c64b1e4419784ae1979594
- Result: FAILED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: amazon-sagemaker-examples-pr
- Commit ID: 9004b214fe5a1ccea0c64b1e4419784ae1979594
- Result: SUCCEEDED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-examples-link-check
- Commit ID: 38ef3f5d50cea2cabd3eee786952f359101a32e8
- Result: SUCCEEDED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-examples-code-formatting
- Commit ID: 38ef3f5d50cea2cabd3eee786952f359101a32e8
- Result: FAILED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-examples-grammar
- Commit ID: 38ef3f5d50cea2cabd3eee786952f359101a32e8
- Result: FAILED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-examples-code-formatting
- Commit ID: d47542b19ee628ab4d0c4003952fcf44e01d062c
- Result: FAILED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-examples-link-check
- Commit ID: d47542b19ee628ab4d0c4003952fcf44e01d062c
- Result: SUCCEEDED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-examples-grammar
- Commit ID: d47542b19ee628ab4d0c4003952fcf44e01d062c
- Result: FAILED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: amazon-sagemaker-examples-pr
- Commit ID: d47542b19ee628ab4d0c4003952fcf44e01d062c
- Result: SUCCEEDED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
Very extensive example! I'm concerned about testing it though. Some of the example is just in readme which isn't testable. Also, there's hardcoding of region to us-east-1 and of instance types which will fail when we test in other regions. There's a lot of security issues that the bots have picked up, but I'd also say if you're setting up network comms like this, you need a security section in the overview of the solution(s). With the complexity of the solution, I can't help but think 1) could a lot of this be done in a container to reduce the amount of noise and complexity for someone trying this out; 2) if untestable across regions, then should it be moved to aws-samples and we can simply link to it from the examples website (with caveats that it goes untested). If you can work through the security issues, and solve the problem for several/many/most commercial regions, then I think we could proceed. But locking it to us-east-1: I don't think is great customer experience.
Thank you.@aaronmarkham
- Testing all code - Except for the root level readme we won't have readme files (we'll link directly to notebooks). We'll also make sure all of the runnable code runs as part of these notebooks. I'd like to keep the solution in SM Examples repo since it's the best, and currently the only, representation to the Heterogeneous clusters feature of SM Training.
- Security communication channels - We'll add a section to the main readme that explains the network connections between the instances, port numbers and their purpose. Unlike inference, which is exposed (directly or indirectly) to end users, this is a training workload, where all the software is controlled by the data scientists that runs training. The solution communicates between training instances (all ports communication between instances is allowed by default), just like other processes that takes place during distributed training (MPI, DDP, Horovod, SSH, etc). We don't add IAM rules, security groups, VPC, etc.
- Security creating processes - This CodeGuru warnings are out of context, as the added code acts as a proxy script and re-executes the data scientists commands, same commands that are executed for any training job. There's no end-user input like in webapp scenario. Normally: user --cmd--> [train.py] With launcher.py acting as a proxy: user --cmd--> [Launcher.py] --cmd+extra_arg--> [train.py]
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-examples-link-check
- Commit ID: 3415bd9c07f86b48bbad0fb6d03989ee046d15c8
- Result: SUCCEEDED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-examples-code-formatting
- Commit ID: 3415bd9c07f86b48bbad0fb6d03989ee046d15c8
- Result: FAILED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-examples-grammar
- Commit ID: 3415bd9c07f86b48bbad0fb6d03989ee046d15c8
- Result: FAILED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-examples-code-formatting
- Commit ID: 0490d31916805e6d4f3e54d129a7a3f2d7391dc9
- Result: FAILED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-examples-link-check
- Commit ID: 0490d31916805e6d4f3e54d129a7a3f2d7391dc9
- Result: SUCCEEDED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository