awsome-distributed-training
awsome-distributed-training copied to clipboard
P5 on AWS Batch w/ Capacity Blocks
- First create a VPC and Subnet in the same AZ as your capacity block. You can use the following template:
- Next you can deploy the AWS Batch template included in this PR, where
cr-053d6fb40dcbb54f2is the id of your capacity block andaws-batch-vpcis the name of the vpc stack you created above.
aws cloudformation create-stack --stack-name aws-batch-p5 \
--template-body file://0.aws-batch-distributed-training-p5.yaml \
--parameters ParameterKey=VPCStackParameter,ParameterValue="aws-batch-vpc" \
ParameterKey=CapacityBlockId,ParameterValue="cr-1234567890" \
--capabilities CAPABILITY_NAMED_IAM
- Next you can submit a job in the aws batch console, by default the NCCLTest Job Definition uses a pre-built container image:
public.ecr.aws/hpc-cloud/nccl-tests:latest
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.