nannaer
nannaer
Thank you very much for your contributions to EPLB! I would like to ask you a question. Does the current main branch support "Support changing locations of experts when server...
> > Support changing locations of experts when server is running > > Sure, `--enable-eplb` > > Support changing locations of experts when server is running > > Sure, `--enable-eplb`...
> start from EPLBManager and the logic is pretty easy to read Thanks!
> @tianhaoz95 Hi > > > since only redundant experts change during the rebalance > > almost all (at least most) experts change indeed Hi expert, take DeepSeek V3 as...
I tried to find where the synchronization is implemented by looking at the code, but I still don't fully understand. Your guidance would be of great help to me! Thanks...
> > Is synchronization across all ranks needed before dispatching SEND/RECV operations? > > Is synchronization across all ranks needed after dispatching SEND/RECV operations? > > Is synchronization across all...
> > What changes will occur in the end-to-end latency of each RANK? Can it be estimated as max(Dispatch latency) + Expert Group Gemm latency + max(Combine latency)? > >...
> For example, the only wait-data-arrival of dispatch is here: https://github.com/deepseek-ai/DeepEP/blob/main/csrc/kernels/internode_ll.cu#L492. How does a RANK know how many inputs it should receive from other RANKs? Does this require an operation...
> [DeepEP/deep_ep/buffer.py](https://github.com/deepseek-ai/DeepEP/blob/483f00af8490b0cc378823c6adecf9ea67602071/deep_ep/buffer.py#L84) > > Line 84 in [483f00a](/deepseek-ai/DeepEP/commit/483f00af8490b0cc378823c6adecf9ea67602071) > > os.environ['NVSHMEM_QP_DEPTH'] = '1024' > > Can you try setting this to a larger number, like 4096? thanks!
> [DeepEP/deep_ep/buffer.py](https://github.com/deepseek-ai/DeepEP/blob/483f00af8490b0cc378823c6adecf9ea67602071/deep_ep/buffer.py#L84) > > Line 84 in [483f00a](/deepseek-ai/DeepEP/commit/483f00af8490b0cc378823c6adecf9ea67602071) > > os.environ['NVSHMEM_QP_DEPTH'] = '1024' > > Can you try setting this to a larger number, like 4096? It still gets stuck...