When testing test_internode.py, what is the effect of setting NVSHMEM_DISABLE_P2P=1? Will NVLink be disabled?

Open MoringKing opened this issue 10 months ago • 5 comments

Feb 28 '25 09:02 MoringKing

After setting NVSHMEM_DISABLE_P2P=1, you cannot use NVSHMEM for NVLink transfers. However, this is not an issue in our implementation, as our NVLink data transfers does not rely on NVSHMEM API. Instead, we directly utilize CUDA PTX instructions for NVLink data transfers.

Feb 28 '25 10:02 sphish

Thank you for your reply. Further, I want to confirm, what is the purpose of setting NVSHMEM_DISABLE_P2P=1 in the low_latency scenario? And what is the impact if NVSHMEM_DISABLE_P2P=1 is set in internode scenarios?

Mar 01 '25 06:03 MoringKing

Enabling this environment variable also works, but we disable it to ensure we are not using NVLink through NVSHMEM.

Mar 03 '25 10:03 sphish

Enabling this environment variable also works, but we disable it to ensure we are not using NVLink through NVSHMEM.

#But when I set NVSHMEM_DISABLE_P2P to 0 and run low latency test on two nodes (16 GPUs), I get the following error:

Mar 11 '25 12:03 GitAlice123

Enabling this environment variable also works, but we disable it to ensure we are not using NVLink through NVSHMEM.

#But when I set NVSHMEM_DISABLE_P2P to 0 and run low latency test on two nodes (16 GPUs), I get the following error:

It appears that the program is failing during the bootstrap phase of NVSHMEM, which doesn't seem reasonable, but I'm not sure why this is happening.

Mar 17 '25 02:03 sphish