awsome-distributed-training icon indicating copy to clipboard operation
awsome-distributed-training copied to clipboard

Remove AWS Pytorch channel in examples

Open junpuf opened this issue 1 year ago • 1 comments

Issue #, if available: #444

Description of changes: The AWS PyTorch Conda channel is being deprecated, future development will be stopped, so removing the usage of it.


Testing

Test infra is 2 p4d using pcluster with fsx and slurm built using DLAMI (the same AMI used by HyperPod AMI).

10.FSDP

  • Passed test
  • Below environment variables are no longer required in newer aws-ofi-nccl versions.
export FI_EFA_USE_DEVICE_RDMA=1 # use for p4d
export FI_EFA_FORK_SAFE=1
  • Fixed README.md

16.pytorch-cpu-ddp

  • Passed test
  • updated conda env creation process

17.SM-modelparallelv2

For 17.SM-modelparallelv2, it seems pytorch="2.2.0=sm_py3.10_cuda12.1_cudnn8.9.5_nccl_pt_2.2_tsm_2.3_cuda12.1_0 declared dependency on aws-ofi-nccl >=1.7.1,<2.0 (probably due to copying the build recipe from aws conda channel). Because of this, i made a workaround by supplying the 2 binaries needed for this pytorch package.

Could not solve for environment specs
The following package could not be installed
└─ pytorch ==2.2.0 sm_py3.10_cuda12.1_cudnn8.9.5_nccl_pt_2.2_tsm_2.3_cuda12.1_0 is not installable because it requires
   └─ aws-ofi-nccl >=1.7.1,<2.0 , which does not exist (perhaps a missing channel).

Workaround included 2 binaries (details below) required in a new bin directory inside the example directory.

# dependency package for pytorch ==2.2.0 sm_py3.10_cuda12.1_cudnn8.9.5_nccl_pt_2.2_tsm_2.3_cuda12.1_0
aws-ofi-nccl-1.7.4-aws_0.tar.bz2 
# dependency package for aws-ofi-nccl-1.7.4-aws_0.tar.bz2 
hwloc-2.9.2-h2bc3f7f_0.tar.bz2

20.FSDP-Mamba

Simply removed commented lines that referenced the AWS PyTorch conda channel, no test necessary


By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

junpuf avatar Oct 07 '24 21:10 junpuf

@junpuf why is this marked as a draft? can we mark it ready for review

sean-smith avatar Oct 08 '24 18:10 sean-smith