awsome-distributed-training
awsome-distributed-training copied to clipboard
NCCL libfabric conflict caused by aws-ofi-nccl 1.9.0
If you've installed aws-ofi-nccl from conda and have a system with version of libfabric <1.18.2 and aws-ofi-nccl 1.9.0 you may face issues such as the following:
[0] NCCL INFO NET/Plugin : dlerror=/opt/amazon/efa/lib/libfabric.so.1: version `FABRIC_1.7' not found (required by /fsx/ubuntu/awsome-distributed-training/3.test_cases/10.FSDP/pt_fsdp/lib/python3.10/site-packages/torch/lib/../../../../libnccl-net.so) No plugin found (libnccl-net.so), using internal implementation
You can fix this by upgrading to aws-ofi-nccl 1.9.1 or downgrading to aws-ofi-nccl 1.7.4 like so:
conda install aws-ofi-nccl=1.7.4 \
--override-channels \
-c https://aws-ml-conda.s3.us-west-2.amazonaws.com/ \
-c nvidia -c conda-forge
Fixed in https://github.com/aws-samples/awsome-distributed-training/pull/291
This issue is stale because it has been open for 30 days with no activity.
This issue was closed because it has been inactive for 14 days since being marked as stale.