DeepEP About num_qps_per

I notice that when Buffer is initialized, all modes (internode, intranode, low-latency) pass the parameter 'num_qps_per_rank', but it seems that only low-latency mode will use it in init. I wondered why?

Mar 05 '25 09:03 GitAlice123

Low-latency kernel will assign a QP for every local expert for extreme low issue overhead, while the number of QPs for normal kernels does not matter for performance.

Mar 05 '25 10:03 LyricZhao

Is there any way to use multiple QPs in other modes (like internode)?

Mar 05 '25 10:03 GitAlice123

Yes, normal kernels follow the NVSHMEM environ vars, e.g. QP number NVSHMEM_IBGDA_NUM_RC_PER_PE.

For more settings, you can refer to https://docs.nvidia.com/nvshmem/api/gen/env.html.

Mar 05 '25 13:03 LyricZhao

But NVSHMEM_IBGDA_NUM_RC_PER_PE is only used for IBGDA (by its name), in other modes we do not use IBGDA. Can I use this parameter to set the QP numbers?

Mar 06 '25 02:03 GitAlice123

But NVSHMEM_IBGDA_NUM_RC_PER_PE is only used for IBGDA (by its name), in other modes we do not use IBGDA. Can I use this parameter to set the QP numbers?

For the current version of NVSHMEM, each transport is fixed to use 1 QP per PE. It is not adjustable.

Mar 06 '25 02:03 sphish

is there pr to fix it? We need to use dual-port RNIC in IBRC mode. Maybe this guy, he's done the appeal. https://github.com/deepseek-ai/DeepEP/issues/74#issuecomment-2735519635

Mar 21 '25 09:03 miter6

About num_qps_per_rank