awsome-distributed-training icon indicating copy to clipboard operation
awsome-distributed-training copied to clipboard

P5 on AWS Batch w/ Capacity Blocks

Open sean-smith opened this issue 1 year ago • 0 comments

  1. First create a VPC and Subnet in the same AZ as your capacity block. You can use the following template:


 1-Click Deploy 🚀 

  1. Next you can deploy the AWS Batch template included in this PR, where cr-053d6fb40dcbb54f2 is the id of your capacity block and aws-batch-vpc is the name of the vpc stack you created above.
aws cloudformation create-stack --stack-name aws-batch-p5 \
                                --template-body file://0.aws-batch-distributed-training-p5.yaml \
                                --parameters ParameterKey=VPCStackParameter,ParameterValue="aws-batch-vpc" \
                                             ParameterKey=CapacityBlockId,ParameterValue="cr-1234567890" \
                                --capabilities CAPABILITY_NAMED_IAM
  1. Next you can submit a job in the aws batch console, by default the NCCLTest Job Definition uses a pre-built container image:
public.ecr.aws/hpc-cloud/nccl-tests:latest

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

sean-smith avatar Oct 04 '24 21:10 sean-smith