awsome-distributed-training
awsome-distributed-training copied to clipboard
fix nccl test eks
Issue #, if available:
Description of changes:
- Fixed image pull policy config, to resolve this error during deployment
Error from server (BadRequest): error when creating "nccl-tests.yaml": MPIJob in version "v2beta1" cannot be handled as a MPIJob: strict decoding error: unknown field "spec.mpiReplicaSpecs.Launcher.template.spec.imagePullPolicy", unknown field "spec.mpiReplicaSpecs.Worker.template.spec.imagePullPolicy" yaml file apiVersion:
- removed unnecessary flags for EFA, these are auto-set in OFI plugin now.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.