DeepEP icon indicating copy to clipboard operation
DeepEP copied to clipboard

Why NUM_MAX_NVL_PEERS must be 8?

Open Huixxi opened this issue 9 months ago • 4 comments

In deep_ep.hpp, Can it be a smaller number? For example I only have a node with 2 H800 to run test_low_latency.py

Huixxi avatar Mar 06 '25 06:03 Huixxi

If you only have one node, EP2 is supported for both intranode kernels (via NVLink) and low-latency kernels (via RDMA).

For multiple nodes with each less than 8 GPUs, you can change the NUM_MAX_NVL_PEERS macro into your settings to see whether the kernels work. We may later add a compilation macro for this. Thanks for feedback.

LyricZhao avatar Mar 06 '25 13:03 LyricZhao

Related issue https://github.com/deepseek-ai/DeepEP/issues/477

alokprasad avatar Nov 03 '25 09:11 alokprasad

@LyricZhao Hitting this assert when switching NUM_MAX_NVL_PEERS to 4:

DeepEP/csrc/kernels/internode.cu(295): error: static assertion failed with "Invalid number of NVL peers"
                    static_assert(4 * sizeof(bool) == sizeof(uint64_t), "Invalid number of NVL peers");
                    ^

DeepEP/csrc/kernels/internode.cu(507): error: static assertion failed with "Invalid number of NVL peers"
        static_assert(4 * sizeof(bool) == sizeof(uint64_t), "Invalid number of NVL peers");

elvircrn avatar Nov 05 '25 19:11 elvircrn

Encountered the same issue. And when I attempt to resolve by commenting out the assertions, I get a new error:

/miniconda3/envs/sglang/lib/python3.10/site-packages/deep_ep/buffer.py", line 135, in __init__ self.runtime.sync(device_ids, ipc_handles, root_unique_id) RuntimeError: Failed: CUDA error /home/annali/sglang/DeepEP/csrc/deep_ep.cpp:113 'invalid resource handle'

I'm on the latest branch of DeepEP with NUM_MAX_NVL_PEERS=2

@LyricZhao Hitting this assert when switching NUM_MAX_NVL_PEERS to 4:

DeepEP/csrc/kernels/internode.cu(295): error: static assertion failed with "Invalid number of NVL peers"
                    static_assert(4 * sizeof(bool) == sizeof(uint64_t), "Invalid number of NVL peers");
                    ^

DeepEP/csrc/kernels/internode.cu(507): error: static assertion failed with "Invalid number of NVL peers"
        static_assert(4 * sizeof(bool) == sizeof(uint64_t), "Invalid number of NVL peers");

annali07 avatar Nov 10 '25 20:11 annali07