Remove AWS Pytorch channel in examples
Issue #, if available: #444
Description of changes: The AWS PyTorch Conda channel is being deprecated, future development will be stopped, so removing the usage of it.
Testing
Test infra is 2 p4d using pcluster with fsx and slurm built using DLAMI (the same AMI used by HyperPod AMI).
10.FSDP
- Passed test
- Below environment variables are no longer required in newer aws-ofi-nccl versions.
export FI_EFA_USE_DEVICE_RDMA=1 # use for p4d
export FI_EFA_FORK_SAFE=1
- Fixed README.md
16.pytorch-cpu-ddp
- Passed test
- updated conda env creation process
17.SM-modelparallelv2
For 17.SM-modelparallelv2, it seems pytorch="2.2.0=sm_py3.10_cuda12.1_cudnn8.9.5_nccl_pt_2.2_tsm_2.3_cuda12.1_0 declared dependency on aws-ofi-nccl >=1.7.1,<2.0 (probably due to copying the build recipe from aws conda channel). Because of this, i made a workaround by supplying the 2 binaries needed for this pytorch package.
Could not solve for environment specs
The following package could not be installed
└─ pytorch ==2.2.0 sm_py3.10_cuda12.1_cudnn8.9.5_nccl_pt_2.2_tsm_2.3_cuda12.1_0 is not installable because it requires
└─ aws-ofi-nccl >=1.7.1,<2.0 , which does not exist (perhaps a missing channel).
Workaround included 2 binaries (details below) required in a new bin directory inside the example directory.
# dependency package for pytorch ==2.2.0 sm_py3.10_cuda12.1_cudnn8.9.5_nccl_pt_2.2_tsm_2.3_cuda12.1_0
aws-ofi-nccl-1.7.4-aws_0.tar.bz2
# dependency package for aws-ofi-nccl-1.7.4-aws_0.tar.bz2
hwloc-2.9.2-h2bc3f7f_0.tar.bz2
20.FSDP-Mamba
Simply removed commented lines that referenced the AWS PyTorch conda channel, no test necessary
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
@junpuf why is this marked as a draft? can we mark it ready for review