DeepEP icon indicating copy to clipboard operation
DeepEP copied to clipboard

When testing test_internode.py, what is the effect of setting NVSHMEM_DISABLE_P2P=1? Will NVLink be disabled?

Open MoringKing opened this issue 10 months ago • 5 comments

Image

MoringKing avatar Feb 28 '25 09:02 MoringKing

After setting NVSHMEM_DISABLE_P2P=1, you cannot use NVSHMEM for NVLink transfers. However, this is not an issue in our implementation, as our NVLink data transfers does not rely on NVSHMEM API. Instead, we directly utilize CUDA PTX instructions for NVLink data transfers.

sphish avatar Feb 28 '25 10:02 sphish

Thank you for your reply. Further, I want to confirm, what is the purpose of setting NVSHMEM_DISABLE_P2P=1 in the low_latency scenario? And what is the impact if NVSHMEM_DISABLE_P2P=1 is set in internode scenarios?

Image

MoringKing avatar Mar 01 '25 06:03 MoringKing

Enabling this environment variable also works, but we disable it to ensure we are not using NVLink through NVSHMEM.

sphish avatar Mar 03 '25 10:03 sphish

Enabling this environment variable also works, but we disable it to ensure we are not using NVLink through NVSHMEM.

#But when I set NVSHMEM_DISABLE_P2P to 0 and run low latency test on two nodes (16 GPUs), I get the following error:

Image

GitAlice123 avatar Mar 11 '25 12:03 GitAlice123

Enabling this environment variable also works, but we disable it to ensure we are not using NVLink through NVSHMEM.

#But when I set NVSHMEM_DISABLE_P2P to 0 and run low latency test on two nodes (16 GPUs), I get the following error:

Image

It appears that the program is failing during the bootstrap phase of NVSHMEM, which doesn't seem reasonable, but I'm not sure why this is happening.

sphish avatar Mar 17 '25 02:03 sphish