awsome-distributed-training NCCL Slowdown caused by aws-ofi-nccl conflict

If you experience an NCCL slowdown the first step is to enable:

export NCCL_DEBUG=INFO

This will allow you to catch an misconfigurations in the logs, for example if you see:

version `EFA_1.2' not found (required by /opt/amazon/efa/lib/libfabric.so.1) No plugin found (libnccl-net.so), using internal implementation

This likely means it's pulling in a version of the plugin aws-ofi-nccl that's not compiled against the system libfabric. You can check this (assuming you're using conda) by running:

conda list | grep -E "nvidia|nccl|cud|torch"

If this shows something like:

nvidia-nccl-cu12           2.19.3                    pypi_0    pypi

It's likely this version is getting pulled in as dependency and isn't working properly. You can override this and install aws-ofi-nccl from Amazon Pytorch like so:

conda install -y \
    aws-ofi-nccl \
    --override-channels \
    -c https://aws-ml-conda.s3.us-west-2.amazonaws.com/ \
    -c nvidia -c conda-forge

The error version EFA_1.2 not found should now disappear from the logs.

Apr 25 '24 17:04 sean-smith

In practice, does this happen only for certain PyTorch build?

Has it ever happened with nccl-tests?

Apr 26 '24 03:04 verdimrc

This issue is stale because it has been open for 30 days with no activity.

Jul 26 '24 01:07 github-actions[bot]

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sep 24 '24 01:09 github-actions[bot]