awsome-distributed-training icon indicating copy to clipboard operation
awsome-distributed-training copied to clipboard

Libfabric Error with NCCL 2.19+

Open sean-smith opened this issue 1 year ago • 1 comments

If you see the following issue in your code after setting FI_INFO=info:

libfabric:652244:1713524816::core:core:cuda_set_sync_memops():207<warn> Failed to perform cuPointerSetAttribute: CUDA_ERROR_NOT_SUPPORTED:operation not supported
libfabric:652244:1713524816::efa:mr:efa_mr_hmem_setup():254<warn> unable to set memops for cuda ptr
libfabric:652244:1713524816::efa:mr:efa_mr_regattr():1014<warn> Unable to register MR: Invalid argument

you can resolve it by setting the following flag:

export FI_EFA_SET_CUDA_SYNC_MEMOPS=0

This effects

  • EFA 1.26.0
  • NCCL 2.19+

sean-smith avatar Apr 19 '24 16:04 sean-smith

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] avatar Jul 25 '24 01:07 github-actions[bot]

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions[bot] avatar Sep 23 '24 01:09 github-actions[bot]