nannaer

Results 30 comments of nannaer

Thank you very much for your contributions to EPLB! I would like to ask you a question. Does the current main branch support "Support changing locations of experts when server...

> > Support changing locations of experts when server is running > > Sure, `--enable-eplb` > > Support changing locations of experts when server is running > > Sure, `--enable-eplb`...

> start from EPLBManager and the logic is pretty easy to read Thanks!

> @tianhaoz95 Hi > > > since only redundant experts change during the rebalance > > almost all (at least most) experts change indeed Hi expert, take DeepSeek V3 as...

I tried to find where the synchronization is implemented by looking at the code, but I still don't fully understand. Your guidance would be of great help to me! Thanks...

> > Is synchronization across all ranks needed before dispatching SEND/RECV operations? > > Is synchronization across all ranks needed after dispatching SEND/RECV operations? > > Is synchronization across all...

> > What changes will occur in the end-to-end latency of each RANK? Can it be estimated as max(Dispatch latency) + Expert Group Gemm latency + max(Combine latency)? > >...

> For example, the only wait-data-arrival of dispatch is here: https://github.com/deepseek-ai/DeepEP/blob/main/csrc/kernels/internode_ll.cu#L492. How does a RANK know how many inputs it should receive from other RANKs? Does this require an operation...

> [DeepEP/deep_ep/buffer.py](https://github.com/deepseek-ai/DeepEP/blob/483f00af8490b0cc378823c6adecf9ea67602071/deep_ep/buffer.py#L84) > > Line 84 in [483f00a](/deepseek-ai/DeepEP/commit/483f00af8490b0cc378823c6adecf9ea67602071) > > os.environ['NVSHMEM_QP_DEPTH'] = '1024' > > Can you try setting this to a larger number, like 4096? thanks!

> [DeepEP/deep_ep/buffer.py](https://github.com/deepseek-ai/DeepEP/blob/483f00af8490b0cc378823c6adecf9ea67602071/deep_ep/buffer.py#L84) > > Line 84 in [483f00a](/deepseek-ai/DeepEP/commit/483f00af8490b0cc378823c6adecf9ea67602071) > > os.environ['NVSHMEM_QP_DEPTH'] = '1024' > > Can you try setting this to a larger number, like 4096? It still gets stuck...