awsome-distributed-training
awsome-distributed-training copied to clipboard
Libfabric Error with NCCL 2.19+
If you see the following issue in your code after setting FI_INFO=info:
libfabric:652244:1713524816::core:core:cuda_set_sync_memops():207<warn> Failed to perform cuPointerSetAttribute: CUDA_ERROR_NOT_SUPPORTED:operation not supported
libfabric:652244:1713524816::efa:mr:efa_mr_hmem_setup():254<warn> unable to set memops for cuda ptr
libfabric:652244:1713524816::efa:mr:efa_mr_regattr():1014<warn> Unable to register MR: Invalid argument
you can resolve it by setting the following flag:
export FI_EFA_SET_CUDA_SYNC_MEMOPS=0
This effects
- EFA 1.26.0
- NCCL 2.19+
This issue is stale because it has been open for 30 days with no activity.
This issue was closed because it has been inactive for 14 days since being marked as stale.