awsome-distributed-training icon indicating copy to clipboard operation
awsome-distributed-training copied to clipboard

fix nccl test eks

Open roywei opened this issue 1 year ago • 0 comments

Issue #, if available:

Description of changes:

  1. Fixed image pull policy config, to resolve this error during deployment
Error from server (BadRequest): error when creating "nccl-tests.yaml": MPIJob in version "v2beta1" cannot be handled as a MPIJob: strict decoding error: unknown field "spec.mpiReplicaSpecs.Launcher.template.spec.imagePullPolicy", unknown field "spec.mpiReplicaSpecs.Worker.template.spec.imagePullPolicy"  yaml file apiVersion: 
  1. removed unnecessary flags for EFA, these are auto-set in OFI plugin now.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

roywei avatar Oct 09 '24 03:10 roywei