NCCL Slowdown caused by aws-ofi-nccl conflict
If you experience an NCCL slowdown the first step is to enable:
export NCCL_DEBUG=INFO
This will allow you to catch an misconfigurations in the logs, for example if you see:
version `EFA_1.2' not found (required by /opt/amazon/efa/lib/libfabric.so.1) No plugin found (libnccl-net.so), using internal implementation
This likely means it's pulling in a version of the plugin aws-ofi-nccl that's not compiled against the system libfabric. You can check this (assuming you're using conda) by running:
conda list | grep -E "nvidia|nccl|cud|torch"
If this shows something like:
nvidia-nccl-cu12 2.19.3 pypi_0 pypi
It's likely this version is getting pulled in as dependency and isn't working properly. You can override this and install aws-ofi-nccl from Amazon Pytorch like so:
conda install -y \
aws-ofi-nccl \
--override-channels \
-c https://aws-ml-conda.s3.us-west-2.amazonaws.com/ \
-c nvidia -c conda-forge
The error version EFA_1.2 not found should now disappear from the logs.
In practice, does this happen only for certain PyTorch build?
Has it ever happened with nccl-tests?
This issue is stale because it has been open for 30 days with no activity.
This issue was closed because it has been inactive for 14 days since being marked as stale.