nannaer issues

Results 8 issues of


                                            nannaer

[Bug] Can't use profiler in DP+EP

### Checklist - [x] 1. I have searched related issues but cannot get the expected help. - [x] 2. The bug has not been fixed in the latest version. -...

How to use test_low_latency to profile the latency with different batch sizes on different RANKs?

As shown in the figure, after assigning different num_tokens parameters to each RANK and running test_low_latency.py, the process gets stuck. What methods can be used to profile the impact of...

Where do dispatch and combine need to be synchronized?

I'd like to know where **_synchronization across all ranks_** is required for both dispatch and combine operations, using the following code that calls low_latency_dispatch and low_latency_combine as an example. Specifically:...

When deploying with multiple machines, the following error was encountered

When deploying PD-separated Deepseek v3 and running multi-machine Decode with two machines, it works normally. However, when using four machines, the following errors occur. I hope to get your help....

illegal memory access

when use flash-attn-3 3.0.0b1 as attention kernel in serving engine to serve Qwen3-235`B-A22B, we encounter illegal memory access as follows (ModelRunner pid=871993, ip=10.102.207.116) CUDA error (/workspace/flash-attention/hopper/flash_fwd_launch_template.h:198): an illegal memory access...

Why balance according to batch size in a technical report?

Thank you very much for your contributions to the RTP-LLM inference engine! I have a question about the load balancing strategy in the technical report. In the DeepEP framework, Dispatch...

When running the two-machine test_low_latency.py (EP16), there is a significant difference in the test results between two machine

When running the two-machine test_low_latency.py (EP16), there is a significant difference in the test results of dispatch and combine on the two machines. My version number is 9fe9021, and I...

Profiling when rank unbalance

Thank you very much for your contributions to the DeepEP! I have a question about the latency of batch per rank unbalance scenario. The above figure is my theoretical estimation...