awsome-distributed-training icon indicating copy to clipboard operation
awsome-distributed-training copied to clipboard

NCCL libfabric conflict caused by aws-ofi-nccl 1.9.0

Open sean-smith opened this issue 1 year ago • 1 comments

If you've installed aws-ofi-nccl from conda and have a system with version of libfabric <1.18.2 and aws-ofi-nccl 1.9.0 you may face issues such as the following:

 [0] NCCL INFO NET/Plugin : dlerror=/opt/amazon/efa/lib/libfabric.so.1: version `FABRIC_1.7' not found (required by /fsx/ubuntu/awsome-distributed-training/3.test_cases/10.FSDP/pt_fsdp/lib/python3.10/site-packages/torch/lib/../../../../libnccl-net.so) No plugin found (libnccl-net.so), using internal implementation

You can fix this by upgrading to aws-ofi-nccl 1.9.1 or downgrading to aws-ofi-nccl 1.7.4 like so:

conda install aws-ofi-nccl=1.7.4 \
--override-channels \
-c https://aws-ml-conda.s3.us-west-2.amazonaws.com/ \
-c nvidia -c conda-forge

Fixed in https://github.com/aws-samples/awsome-distributed-training/pull/291

sean-smith avatar May 01 '24 04:05 sean-smith

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] avatar Jul 31 '24 01:07 github-actions[bot]

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions[bot] avatar Sep 29 '24 02:09 github-actions[bot]