About num_qps_per_rank
I notice that when Buffer is initialized, all modes (internode, intranode, low-latency) pass the parameter 'num_qps_per_rank', but it seems that only low-latency mode will use it in init. I wondered why?
Low-latency kernel will assign a QP for every local expert for extreme low issue overhead, while the number of QPs for normal kernels does not matter for performance.
Is there any way to use multiple QPs in other modes (like internode)?
Yes, normal kernels follow the NVSHMEM environ vars, e.g. QP number NVSHMEM_IBGDA_NUM_RC_PER_PE.
For more settings, you can refer to https://docs.nvidia.com/nvshmem/api/gen/env.html.
But NVSHMEM_IBGDA_NUM_RC_PER_PE is only used for IBGDA (by its name), in other modes we do not use IBGDA. Can I use this parameter to set the QP numbers?
But NVSHMEM_IBGDA_NUM_RC_PER_PE is only used for IBGDA (by its name), in other modes we do not use IBGDA. Can I use this parameter to set the QP numbers?
For the current version of NVSHMEM, each transport is fixed to use 1 QP per PE. It is not adjustable.
is there pr to fix it? We need to use dual-port RNIC in IBRC mode. Maybe this guy, he's done the appeal. https://github.com/deepseek-ai/DeepEP/issues/74#issuecomment-2735519635