mhuguesaws
mhuguesaws
*Description of changes:* EFA Installer 1.35.0 (Waiting release) AWS-ofi-nccl 1.12.0 By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms...
In current form, there are various files without specific orchestrator. This issue to organize per orchestrator: - kubernets/train.yaml - slurm/train.sbatch
Docker file in FSDP does not have specific version. It is best practice to specify versions and not used latest. https://github.com/aws-samples/awsome-distributed-training/blob/main/3.test_cases/10.FSDP/Dockerfile#L2
All nodes running the install script will change slurm global configuration that is shared across nodes.