DeepEP How to use test_low_latency to profile the latency with different batch sizes on different RANKs?

As shown in the figure, after assigning different num_tokens parameters to each RANK and running test_low_latency.py, the process gets stuck. What methods can be used to profile the impact of different batch sizes on different RANKs on latency?

Jun 13 '25 04:06 nannaer

https://github.com/deepseek-ai/DeepEP/blob/483f00af8490b0cc378823c6adecf9ea67602071/deep_ep/buffer.py#L84 Can you try setting this to a larger number, like 4096?

Jun 13 '25 07:06 sphish

DeepEP/deep_ep/buffer.py

Line 84 in 483f00a

os.environ['NVSHMEM_QP_DEPTH'] = '1024'

Can you try setting this to a larger number, like 4096?

thanks!

Jun 13 '25 08:06 nannaer

DeepEP/deep_ep/buffer.py

Line 84 in 483f00a

os.environ['NVSHMEM_QP_DEPTH'] = '1024'

Can you try setting this to a larger number, like 4096?

It still gets stuck here even after the setting. I use CUDA_LAUNCH_BLOCKING=1 to see where to get stuck. As follows:

Jun 15 '25 07:06 nannaer

DeepEP/deep_ep/buffer.py Line 84 in 483f00a os.environ['NVSHMEM_QP_DEPTH'] = '1024' Can you try setting this to a larger number, like 4096?

It still gets stuck here even after the setting. I use CUDA_LAUNCH_BLOCKING=1 to see where to get stuck. As follows:

I have solve this~

Jun 15 '25 13:06 nannaer

DeepEP/deep_ep/buffer.py

Line 84 in 483f00a

os.environ['NVSHMEM_QP_DEPTH'] = '1024'

Can you try setting this to a larger number, like 4096?

The result of running test_low_latency.py is shown in the figure below. How can I judge the impact of communication imbalance through the data below? I set the num_tokens of the first machine (RANK0-RANK8) to 256, while setting the num_tokens of the other three machines (RANK9-RANK32) to 64. This data from the second machine puzzles me: Dispatch bandwidth: 5.91 GB/s, avg_t=629.40 us. The first machine has the highest number of num_tokens and the largest amount of data to send, but why does it have such a low latency?

Machine 1

[rank 3] Dispatch + combine bandwidth: 48.01 GB/s, avg_t=922.98 us, min_t=868.06 us, max_t=933.60 us [rank 4] Dispatch + combine bandwidth: 48.03 GB/s, avg_t=922.68 us, min_t=872.64 us, max_t=938.62 us [rank 5] Dispatch + combine bandwidth: 48.01 GB/s, avg_t=923.03 us, min_t=883.58 us, max_t=945.31 us [rank 7] Dispatch + combine bandwidth: 48.03 GB/s, avg_t=922.60 us, min_t=854.78 us, max_t=934.88 us [rank 2] Dispatch + combine bandwidth: 48.01 GB/s, avg_t=922.95 us, min_t=870.50 us, max_t=935.07 us [rank 0] Dispatch + combine bandwidth: 48.02 GB/s, avg_t=922.88 us, min_t=889.44 us, max_t=954.91 us [rank 1] Dispatch + combine bandwidth: 48.03 GB/s, avg_t=922.71 us, min_t=878.59 us, max_t=943.55 us [rank 6] Dispatch + combine bandwidth: 48.04 GB/s, avg_t=922.48 us, min_t=856.80 us, max_t=931.94 us [rank 7] Dispatch bandwidth: 208.34 GB/s, avg_t=72.47 us | Combine bandwidth: 37.45 GB/s, avg_t=780.09 us [rank 3] Dispatch bandwidth: 210.06 GB/s, avg_t=71.87 us | Combine bandwidth: 37.42 GB/s, avg_t=780.87 us [rank 2] Dispatch bandwidth: 209.64 GB/s, avg_t=72.02 us | Combine bandwidth: 37.46 GB/s, avg_t=779.87 us [rank 1] Dispatch bandwidth: 233.36 GB/s, avg_t=64.70 us | Combine bandwidth: 37.08 GB/s, avg_t=787.88 us [rank 4] Dispatch bandwidth: 265.61 GB/s, avg_t=56.84 us | Combine bandwidth: 36.71 GB/s, avg_t=795.90 us [rank 0] Dispatch bandwidth: 256.49 GB/s, avg_t=58.86 us | Combine bandwidth: 36.81 GB/s, avg_t=793.80 us [rank 5] Dispatch bandwidth: 221.35 GB/s, avg_t=68.21 us | Combine bandwidth: 37.24 GB/s, avg_t=784.46 us [rank 6] Dispatch bandwidth: 278.40 GB/s, avg_t=54.23 us | Combine bandwidth: 36.58 GB/s, avg_t=798.71 us [rank 1] Dispatch send/recv time: 49.49 us | Combine send/recv time: 67.23 us [rank 5] Dispatch send/recv time: 49.53 us | Combine send/recv time: 71.31 us [rank 2] Dispatch send/recv time: 51.29 us | Combine send/recv time: 71.00 us [rank 3] Dispatch send/recv time: 51.21 us | Combine send/recv time: 71.27 us [rank 7] Dispatch send/recv time: 52.86 us | Combine send/recv time: 72.19 us [rank 4] Dispatch send/recv time: 49.47 us | Combine send/recv time: 74.48 us [rank 0] Dispatch send/recv time: 48.11 us | Combine send/recv time: 73.28 us [rank 6] Dispatch send/recv time: 49.99 us | Combine send/recv time: 70.17 us

Machine 2

[rank 15] Dispatch + combine bandwidth: 11.81 GB/s, avg_t=924.27 us, min_t=884.29 us, max_t=949.86 us [rank 13] Dispatch + combine bandwidth: 11.80 GB/s, avg_t=925.17 us, min_t=903.30 us, max_t=952.16 us [rank 9] Dispatch + combine bandwidth: 11.80 GB/s, avg_t=925.33 us, min_t=903.49 us, max_t=938.34 us [rank 14] Dispatch + combine bandwidth: 11.80 GB/s, avg_t=924.95 us, min_t=887.97 us, max_t=959.39 us [rank 8] Dispatch + combine bandwidth: 11.81 GB/s, avg_t=924.51 us, min_t=900.77 us, max_t=948.86 us [rank 10] Dispatch + combine bandwidth: 11.79 GB/s, avg_t=925.57 us, min_t=905.06 us, max_t=944.29 us [rank 11] Dispatch + combine bandwidth: 11.80 GB/s, avg_t=924.71 us, min_t=892.83 us, max_t=957.22 us [rank 12] Dispatch + combine bandwidth: 11.80 GB/s, avg_t=925.21 us, min_t=902.56 us, max_t=948.64 us [rank 15] Dispatch bandwidth: 5.97 GB/s, avg_t=622.84 us | Combine bandwidth: 32.70 GB/s, avg_t=220.05 us [rank 8] Dispatch bandwidth: 5.91 GB/s, avg_t=629.20 us | Combine bandwidth: 33.29 GB/s, avg_t=216.15 us [rank 9] Dispatch bandwidth: 5.97 GB/s, avg_t=622.41 us | Combine bandwidth: 32.50 GB/s, avg_t=221.45 us [rank 12] Dispatch bandwidth: 5.88 GB/s, avg_t=632.57 us | Combine bandwidth: 34.26 GB/s, avg_t=210.04 us [rank 11] Dispatch bandwidth: 5.99 GB/s, avg_t=621.00 us | Combine bandwidth: 32.43 GB/s, avg_t=221.94 us [rank 10] Dispatch bandwidth: 5.94 GB/s, avg_t=626.25 us | Combine bandwidth: 33.29 GB/s, avg_t=216.16 us [rank 13] Dispatch bandwidth: 6.00 GB/s, avg_t=620.22 us | Combine bandwidth: 32.32 GB/s, avg_t=222.69 us [rank 14] Dispatch bandwidth: 6.05 GB/s, avg_t=614.58 us | Combine bandwidth: 31.54 GB/s, avg_t=228.16 us [rank 15] Dispatch send/recv time: 31.10 us | Combine send/recv time: 31.74 us [rank 12] Dispatch send/recv time: 29.67 us | Combine send/recv time: 30.48 us [rank 9] Dispatch send/recv time: 31.20 us | Combine send/recv time: 30.38 us [rank 14] Dispatch send/recv time: 30.86 us | Combine send/recv time: 32.12 us [rank 8] Dispatch send/recv time: 32.31 us | Combine send/recv time: 30.44 us [rank 11] Dispatch send/recv time: 30.91 us | Combine send/recv time: 29.96 us [rank 10] Dispatch send/recv time: 29.79 us | Combine send/recv time: 32.07 us [rank 13] Dispatch send/recv time: 29.90 us | Combine send/recv time: 31.25 us

Machine 3

[rank 17] Dispatch + combine bandwidth: 11.81 GB/s, avg_t=924.33 us, min_t=895.74 us, max_t=949.95 us [rank 20] Dispatch + combine bandwidth: 11.82 GB/s, avg_t=923.30 us, min_t=842.27 us, max_t=981.41 us [rank 18] Dispatch + combine bandwidth: 11.81 GB/s, avg_t=923.96 us, min_t=878.59 us, max_t=954.08 us [rank 21] Dispatch + combine bandwidth: 11.81 GB/s, avg_t=924.61 us, min_t=885.06 us, max_t=960.54 us [rank 19] Dispatch + combine bandwidth: 11.81 GB/s, avg_t=924.20 us, min_t=897.95 us, max_t=961.57 us [rank 23] Dispatch + combine bandwidth: 11.80 GB/s, avg_t=925.08 us, min_t=908.13 us, max_t=947.81 us [rank 22] Dispatch + combine bandwidth: 11.80 GB/s, avg_t=925.04 us, min_t=898.37 us, max_t=962.91 us [rank 16] Dispatch + combine bandwidth: 11.84 GB/s, avg_t=923.93 us, min_t=895.74 us, max_t=947.62 us [rank 21] Dispatch bandwidth: 5.91 GB/s, avg_t=629.40 us | Combine bandwidth: 32.51 GB/s, avg_t=221.39 us [rank 22] Dispatch bandwidth: 5.90 GB/s, avg_t=630.74 us | Combine bandwidth: 32.71 GB/s, avg_t=220.01 us [rank 19] Dispatch bandwidth: 5.93 GB/s, avg_t=626.89 us | Combine bandwidth: 32.08 GB/s, avg_t=224.35 us [rank 18] Dispatch bandwidth: 6.02 GB/s, avg_t=617.62 us | Combine bandwidth: 30.95 GB/s, avg_t=232.56 us [rank 16] Dispatch bandwidth: 5.94 GB/s, avg_t=626.98 us | Combine bandwidth: 32.24 GB/s, avg_t=223.66 us [rank 17] Dispatch bandwidth: 5.99 GB/s, avg_t=621.34 us | Combine bandwidth: 31.47 GB/s, avg_t=228.67 us [rank 20] Dispatch bandwidth: 6.14 GB/s, avg_t=606.08 us | Combine bandwidth: 29.43 GB/s, avg_t=244.54 us [rank 23] Dispatch bandwidth: 6.06 GB/s, avg_t=613.91 us | Combine bandwidth: 30.49 GB/s, avg_t=236.05 us [rank 19] Dispatch send/recv time: 30.00 us | Combine send/recv time: 29.63 us [rank 18] Dispatch send/recv time: 29.11 us | Combine send/recv time: 29.69 us [rank 16] Dispatch send/recv time: 32.54 us | Combine send/recv time: 29.11 us [rank 20] Dispatch send/recv time: 29.63 us | Combine send/recv time: 28.86 us [rank 22] Dispatch send/recv time: 29.93 us | Combine send/recv time: 27.90 us [rank 21] Dispatch send/recv time: 29.09 us | Combine send/recv time: 29.94 us [rank 17] Dispatch send/recv time: 31.20 us | Combine send/recv time: 30.53 us [rank 23] Dispatch send/recv time: 29.40 us | Combine send/recv time: 29.76 us

Machine 4

[rank 30] Dispatch + combine bandwidth: 11.83 GB/s, avg_t=924.69 us, min_t=888.86 us, max_t=955.97 us [rank 26] Dispatch + combine bandwidth: 11.83 GB/s, avg_t=924.46 us, min_t=892.93 us, max_t=942.66 us [rank 28] Dispatch + combine bandwidth: 11.81 GB/s, avg_t=923.98 us, min_t=880.38 us, max_t=965.50 us [rank 29] Dispatch + combine bandwidth: 11.79 GB/s, avg_t=925.50 us, min_t=878.05 us, max_t=971.10 us [rank 27] Dispatch + combine bandwidth: 11.80 GB/s, avg_t=925.38 us, min_t=869.34 us, max_t=963.65 us [rank 24] Dispatch + combine bandwidth: 11.79 GB/s, avg_t=925.49 us, min_t=882.14 us, max_t=964.99 us [rank 25] Dispatch + combine bandwidth: 11.81 GB/s, avg_t=924.61 us, min_t=898.34 us, max_t=956.80 us [rank 31] Dispatch + combine bandwidth: 11.81 GB/s, avg_t=923.92 us, min_t=908.48 us, max_t=949.73 us [rank 27] Dispatch bandwidth: 6.12 GB/s, avg_t=607.51 us | Combine bandwidth: 30.74 GB/s, avg_t=234.09 us [rank 24] Dispatch bandwidth: 5.95 GB/s, avg_t=624.99 us | Combine bandwidth: 32.76 GB/s, avg_t=219.69 us [rank 28] Dispatch bandwidth: 6.05 GB/s, avg_t=614.50 us | Combine bandwidth: 31.71 GB/s, avg_t=226.95 us [rank 25] Dispatch bandwidth: 5.91 GB/s, avg_t=629.47 us | Combine bandwidth: 33.77 GB/s, avg_t=213.11 us [rank 29] Dispatch bandwidth: 6.04 GB/s, avg_t=616.03 us | Combine bandwidth: 31.91 GB/s, avg_t=225.54 us [rank 26] Dispatch bandwidth: 5.94 GB/s, avg_t=627.61 us | Combine bandwidth: 33.70 GB/s, avg_t=213.99 us [rank 30] Dispatch bandwidth: 5.98 GB/s, avg_t=623.24 us | Combine bandwidth: 32.99 GB/s, avg_t=218.60 us [rank 31] Dispatch bandwidth: 6.11 GB/s, avg_t=608.46 us | Combine bandwidth: 30.86 GB/s, avg_t=233.22 us [rank 27] Dispatch send/recv time: 28.63 us | Combine send/recv time: 29.32 us [rank 26] Dispatch send/recv time: 30.38 us | Combine send/recv time: 28.56 us [rank 30] Dispatch send/recv time: 29.99 us | Combine send/recv time: 29.05 us [rank 24] Dispatch send/recv time: 32.35 us | Combine send/recv time: 30.39 us [rank 31] Dispatch send/recv time: 28.86 us | Combine send/recv time: 29.91 us [rank 28] Dispatch send/recv time: 29.40 us | Combine send/recv time: 30.20 us [rank 29] Dispatch send/recv time: 29.56 us | Combine send/recv time: 28.67 us [rank 25] Dispatch send/recv time: 29.20 us | Combine send/recv time: 29.04 us

Jun 15 '25 13:06 nannaer

I have solve this~

How did you resolve the issue?

Jun 16 '25 01:06 sphish

The first machine has the highest number of num_tokens and the largest amount of data to send, but why does it have such a low latency?

Since the time spent submitting send requests accounts for only a small portion, the main source of dispatch latency here comes from the time spent receiving data. The amount of data to be received is small, so the latency is also low.

Jun 16 '25 01:06 sphish

The first machine has the highest number of num_tokens and the largest amount of data to send, but why does it have such a low latency?

Since the time spent submitting send requests accounts for only a small portion, the main source of dispatch latency here comes from the time spent receiving data. The amount of data to be received is small, so the latency is also low.

I still don't quite understand. Could you please explain it in more detail? Thank you very much for your patient guidance. From the data of the four machines, I set the num_tokens of machine 1 (RANK0-RANK8) to 256, while setting the num_tokens of the other three machines (RANK9-RANK32, machine 2 - machine 4) to 64. According to this setting, theoretically, the dispatch latency and combine latency of RANK0-RANK8 should be significantly higher than that of RANK9-RANK32. However, from the logs, it can be seen that the dispatch + combine bandwidth and dispatch send/recv time are consistent with the theory. But the latency using SM (Streaming Multiprocessors) is inconsistent with the theory. For example, machine 1 has Dispatch bandwidth: 208.34 GB/s, avg_t=72.47 us | Combine bandwidth: 37.45 GB/s, avg_t=780.09 us, while machine 2 has Dispatch bandwidth: 5.97 GB/s, avg_t=622.84 us | Combine bandwidth: 32.70 GB/s, avg_t=220.05 us.

Jun 16 '25 02:06 nannaer

I have solve this~

How did you resolve the issue?

This is because in the low_latency_dispatch function, there is a parameter called num_max_dispatch_tokens_per_rank, which needs to be the same for all ranks. However, in the test_low_latency.py script, this parameter is set to the number of tokens for each rank. As a result, when profiling with an uneven distribution of token numbers across different ranks, the process would get stuck. Now, I have set this parameter to a larger value for every rank, such as 512.

Jun 16 '25 03:06 nannaer

I still don't quite understand. Could you please explain it in more detail?

In your test settings, machine 1 has more send tokens than the other machines, which also means that machine 1 has fewer receive tokens compared to the others. Therefore, the time it takes for machine 1 to receive tokens is shorter than that of the other machines.

Note: The shown bandwidth is not a concern here, because the bandwidth calculation assumes that the number of send tokens is equal to the number of receive tokens.

Jun 16 '25 03:06 sphish

I still don't quite understand. Could you please explain it in more detail?

In your test settings, machine 1 has more send tokens than the other machines, which also means that machine 1 has fewer receive tokens compared to the others. Therefore, the time it takes for machine 1 to receive tokens is shorter than that of the other machines.

Note: The shown bandwidth is not a concern here, because the bandwidth calculation assumes that the number of send tokens is equal to the number of receive tokens.

Thank you for your detailed explanation! I have the following questions that I don't understand:

(1) Since the network is full-duplex, the receiving bandwidth and transmitting bandwidth should be equal. Why does the dispatch delay mainly come from the time spent receiving data rather than sending data?

(2) "machine 1 has more send tokens than the other machines, which also means that machine 1 has fewer receive tokens compared to the others." Taking two machines as an example, where machine 1 has 100 tokens and machine 2 has 10 tokens. If machine 1 needs to receive 5 tokens and machine 2 needs to receive 50 tokens, does it mean that machine 1 has lower delay in the receive-bound scenario?

Jun 16 '25 05:06 nannaer

I still don't quite understand. Could you please explain it in more detail?

In your test settings, machine 1 has more send tokens than the other machines, which also means that machine 1 has fewer receive tokens compared to the others. Therefore, the time it takes for machine 1 to receive tokens is shorter than that of the other machines.

Note: The shown bandwidth is not a concern here, because the bandwidth calculation assumes that the number of send tokens is equal to the number of receive tokens.

In addition to the above, I have another question that I don't understand:

(1) Why is the delay significantly slower when return_recv_hook is set to False compared to when it is set to True?

(2) According to the logs, when return_recv_hook is set to False, Machine 1, which has more tokens, has higher dispatch latency but lower combine latency. When return_recv_hook is set to True, Machine 1, which has more tokens, has higher latency for both dispatch and combine. Why is this the case?

Thank you very much for your guidance! Discussing with you has been very enlightening.

Jun 16 '25 05:06 nannaer

Since the network is full-duplex, the receiving bandwidth and transmitting bandwidth should be equal. Why does the dispatch delay mainly come from the time spent receiving data rather than sending data?

This is because sending is async and doesn't require to wait for complete, whereas receiving does.

"machine 1 has more send tokens than the other machines, which also means that machine 1 has fewer receive tokens compared to the others." Taking two machines as an example, where machine 1 has 100 tokens and machine 2 has 10 tokens. If machine 1 needs to receive 5 tokens and machine 2 needs to receive 50 tokens, does it mean that machine 1 has lower delay in the receive-bound scenario?

Yes.

Jun 17 '25 02:06 sphish

In addition to the above, I have another question that I don't understand:

(1) Why is the delay significantly slower when return_recv_hook is set to False compared to when it is set to True?

(2) According to the logs, when return_recv_hook is set to False, Machine 1, which has more tokens, has higher dispatch latency but lower combine latency. When return_recv_hook is set to True, Machine 1, which has more tokens, has higher latency for both dispatch and combine. Why is this the case?

Thank you very much for your guidance! Discussing with you has been very enlightening.

Can you please provide some logs to demonstrate the issue you encountered?

Jun 17 '25 02:06 sphish

In addition to the above, I have another question that I don't understand: (1) Why is the delay significantly slower when return_recv_hook is set to False compared to when it is set to True? (2) According to the logs, when return_recv_hook is set to False, Machine 1, which has more tokens, has higher dispatch latency but lower combine latency. When return_recv_hook is set to True, Machine 1, which has more tokens, has higher latency for both dispatch and combine. Why is this the case? Thank you very much for your guidance! Discussing with you has been very enlightening.

Can you please provide some logs to demonstrate the issue you encountered?

I set the num_tokens of the first machine (RANK0-RANK8) to 256, while setting the num_tokens of the other three machines (RANK9-RANK32) to 64. The Dispatch send/recv time line corresponds to return_recv_hook being True, while the Dispatch bandwidth line corresponds to return_recv_hook being False. I have bolded the data related to my question.(Machine1 rank 7 and Machine2 rank15)

Machine 1

[rank 3] Dispatch + combine bandwidth: 48.01 GB/s, avg_t=922.98 us, min_t=868.06 us, max_t=933.60 us [rank 4] Dispatch + combine bandwidth: 48.03 GB/s, avg_t=922.68 us, min_t=872.64 us, max_t=938.62 us [rank 5] Dispatch + combine bandwidth: 48.01 GB/s, avg_t=923.03 us, min_t=883.58 us, max_t=945.31 us [rank 7] Dispatch + combine bandwidth: 48.03 GB/s, avg_t=922.60 us, min_t=854.78 us, max_t=934.88 us [rank 2] Dispatch + combine bandwidth: 48.01 GB/s, avg_t=922.95 us, min_t=870.50 us, max_t=935.07 us [rank 0] Dispatch + combine bandwidth: 48.02 GB/s, avg_t=922.88 us, min_t=889.44 us, max_t=954.91 us [rank 1] Dispatch + combine bandwidth: 48.03 GB/s, avg_t=922.71 us, min_t=878.59 us, max_t=943.55 us [rank 6] Dispatch + combine bandwidth: 48.04 GB/s, avg_t=922.48 us, min_t=856.80 us, max_t=931.94 us [rank 7] Dispatch bandwidth: 208.34 GB/s, avg_t=72.47 us | Combine bandwidth: 37.45 GB/s, avg_t=780.09 us [rank 3] Dispatch bandwidth: 210.06 GB/s, avg_t=71.87 us | Combine bandwidth: 37.42 GB/s, avg_t=780.87 us [rank 2] Dispatch bandwidth: 209.64 GB/s, avg_t=72.02 us | Combine bandwidth: 37.46 GB/s, avg_t=779.87 us [rank 1] Dispatch bandwidth: 233.36 GB/s, avg_t=64.70 us | Combine bandwidth: 37.08 GB/s, avg_t=787.88 us [rank 4] Dispatch bandwidth: 265.61 GB/s, avg_t=56.84 us | Combine bandwidth: 36.71 GB/s, avg_t=795.90 us [rank 0] Dispatch bandwidth: 256.49 GB/s, avg_t=58.86 us | Combine bandwidth: 36.81 GB/s, avg_t=793.80 us [rank 5] Dispatch bandwidth: 221.35 GB/s, avg_t=68.21 us | Combine bandwidth: 37.24 GB/s, avg_t=784.46 us [rank 6] Dispatch bandwidth: 278.40 GB/s, avg_t=54.23 us | Combine bandwidth: 36.58 GB/s, avg_t=798.71 us [rank 1] Dispatch send/recv time: 49.49 us | Combine send/recv time: 67.23 us [rank 5] Dispatch send/recv time: 49.53 us | Combine send/recv time: 71.31 us [rank 2] Dispatch send/recv time: 51.29 us | Combine send/recv time: 71.00 us [rank 3] Dispatch send/recv time: 51.21 us | Combine send/recv time: 71.27 us [rank 7] Dispatch send/recv time: 52.86 us | Combine send/recv time: 72.19 us [rank 4] Dispatch send/recv time: 49.47 us | Combine send/recv time: 74.48 us [rank 0] Dispatch send/recv time: 48.11 us | Combine send/recv time: 73.28 us [rank 6] Dispatch send/recv time: 49.99 us | Combine send/recv time: 70.17 us

Machine 2

[rank 15] Dispatch + combine bandwidth: 11.81 GB/s, avg_t=924.27 us, min_t=884.29 us, max_t=949.86 us [rank 13] Dispatch + combine bandwidth: 11.80 GB/s, avg_t=925.17 us, min_t=903.30 us, max_t=952.16 us [rank 9] Dispatch + combine bandwidth: 11.80 GB/s, avg_t=925.33 us, min_t=903.49 us, max_t=938.34 us [rank 14] Dispatch + combine bandwidth: 11.80 GB/s, avg_t=924.95 us, min_t=887.97 us, max_t=959.39 us [rank 8] Dispatch + combine bandwidth: 11.81 GB/s, avg_t=924.51 us, min_t=900.77 us, max_t=948.86 us [rank 10] Dispatch + combine bandwidth: 11.79 GB/s, avg_t=925.57 us, min_t=905.06 us, max_t=944.29 us [rank 11] Dispatch + combine bandwidth: 11.80 GB/s, avg_t=924.71 us, min_t=892.83 us, max_t=957.22 us [rank 12] Dispatch + combine bandwidth: 11.80 GB/s, avg_t=925.21 us, min_t=902.56 us, max_t=948.64 us [rank 15] Dispatch bandwidth: 5.97 GB/s, avg_t=622.84 us | Combine bandwidth: 32.70 GB/s, avg_t=220.05 us [rank 8] Dispatch bandwidth: 5.91 GB/s, avg_t=629.20 us | Combine bandwidth: 33.29 GB/s, avg_t=216.15 us [rank 9] Dispatch bandwidth: 5.97 GB/s, avg_t=622.41 us | Combine bandwidth: 32.50 GB/s, avg_t=221.45 us [rank 12] Dispatch bandwidth: 5.88 GB/s, avg_t=632.57 us | Combine bandwidth: 34.26 GB/s, avg_t=210.04 us [rank 11] Dispatch bandwidth: 5.99 GB/s, avg_t=621.00 us | Combine bandwidth: 32.43 GB/s, avg_t=221.94 us [rank 10] Dispatch bandwidth: 5.94 GB/s, avg_t=626.25 us | Combine bandwidth: 33.29 GB/s, avg_t=216.16 us [rank 13] Dispatch bandwidth: 6.00 GB/s, avg_t=620.22 us | Combine bandwidth: 32.32 GB/s, avg_t=222.69 us [rank 14] Dispatch bandwidth: 6.05 GB/s, avg_t=614.58 us | Combine bandwidth: 31.54 GB/s, avg_t=228.16 us [rank 15] Dispatch send/recv time: 31.10 us | Combine send/recv time: 31.74 us [rank 12] Dispatch send/recv time: 29.67 us | Combine send/recv time: 30.48 us [rank 9] Dispatch send/recv time: 31.20 us | Combine send/recv time: 30.38 us [rank 14] Dispatch send/recv time: 30.86 us | Combine send/recv time: 32.12 us [rank 8] Dispatch send/recv time: 32.31 us | Combine send/recv time: 30.44 us [rank 11] Dispatch send/recv time: 30.91 us | Combine send/recv time: 29.96 us [rank 10] Dispatch send/recv time: 29.79 us | Combine send/recv time: 32.07 us [rank 13] Dispatch send/recv time: 29.90 us | Combine send/recv time: 31.25 us

Machine 3

[rank 17] Dispatch + combine bandwidth: 11.81 GB/s, avg_t=924.33 us, min_t=895.74 us, max_t=949.95 us [rank 20] Dispatch + combine bandwidth: 11.82 GB/s, avg_t=923.30 us, min_t=842.27 us, max_t=981.41 us [rank 18] Dispatch + combine bandwidth: 11.81 GB/s, avg_t=923.96 us, min_t=878.59 us, max_t=954.08 us [rank 21] Dispatch + combine bandwidth: 11.81 GB/s, avg_t=924.61 us, min_t=885.06 us, max_t=960.54 us [rank 19] Dispatch + combine bandwidth: 11.81 GB/s, avg_t=924.20 us, min_t=897.95 us, max_t=961.57 us [rank 23] Dispatch + combine bandwidth: 11.80 GB/s, avg_t=925.08 us, min_t=908.13 us, max_t=947.81 us [rank 22] Dispatch + combine bandwidth: 11.80 GB/s, avg_t=925.04 us, min_t=898.37 us, max_t=962.91 us [rank 16] Dispatch + combine bandwidth: 11.84 GB/s, avg_t=923.93 us, min_t=895.74 us, max_t=947.62 us [rank 21] Dispatch bandwidth: 5.91 GB/s, avg_t=629.40 us | Combine bandwidth: 32.51 GB/s, avg_t=221.39 us [rank 22] Dispatch bandwidth: 5.90 GB/s, avg_t=630.74 us | Combine bandwidth: 32.71 GB/s, avg_t=220.01 us [rank 19] Dispatch bandwidth: 5.93 GB/s, avg_t=626.89 us | Combine bandwidth: 32.08 GB/s, avg_t=224.35 us [rank 18] Dispatch bandwidth: 6.02 GB/s, avg_t=617.62 us | Combine bandwidth: 30.95 GB/s, avg_t=232.56 us [rank 16] Dispatch bandwidth: 5.94 GB/s, avg_t=626.98 us | Combine bandwidth: 32.24 GB/s, avg_t=223.66 us [rank 17] Dispatch bandwidth: 5.99 GB/s, avg_t=621.34 us | Combine bandwidth: 31.47 GB/s, avg_t=228.67 us [rank 20] Dispatch bandwidth: 6.14 GB/s, avg_t=606.08 us | Combine bandwidth: 29.43 GB/s, avg_t=244.54 us [rank 23] Dispatch bandwidth: 6.06 GB/s, avg_t=613.91 us | Combine bandwidth: 30.49 GB/s, avg_t=236.05 us [rank 19] Dispatch send/recv time: 30.00 us | Combine send/recv time: 29.63 us [rank 18] Dispatch send/recv time: 29.11 us | Combine send/recv time: 29.69 us [rank 16] Dispatch send/recv time: 32.54 us | Combine send/recv time: 29.11 us [rank 20] Dispatch send/recv time: 29.63 us | Combine send/recv time: 28.86 us [rank 22] Dispatch send/recv time: 29.93 us | Combine send/recv time: 27.90 us [rank 21] Dispatch send/recv time: 29.09 us | Combine send/recv time: 29.94 us [rank 17] Dispatch send/recv time: 31.20 us | Combine send/recv time: 30.53 us [rank 23] Dispatch send/recv time: 29.40 us | Combine send/recv time: 29.76 us

Machine 4

[rank 30] Dispatch + combine bandwidth: 11.83 GB/s, avg_t=924.69 us, min_t=888.86 us, max_t=955.97 us [rank 26] Dispatch + combine bandwidth: 11.83 GB/s, avg_t=924.46 us, min_t=892.93 us, max_t=942.66 us [rank 28] Dispatch + combine bandwidth: 11.81 GB/s, avg_t=923.98 us, min_t=880.38 us, max_t=965.50 us [rank 29] Dispatch + combine bandwidth: 11.79 GB/s, avg_t=925.50 us, min_t=878.05 us, max_t=971.10 us [rank 27] Dispatch + combine bandwidth: 11.80 GB/s, avg_t=925.38 us, min_t=869.34 us, max_t=963.65 us [rank 24] Dispatch + combine bandwidth: 11.79 GB/s, avg_t=925.49 us, min_t=882.14 us, max_t=964.99 us [rank 25] Dispatch + combine bandwidth: 11.81 GB/s, avg_t=924.61 us, min_t=898.34 us, max_t=956.80 us [rank 31] Dispatch + combine bandwidth: 11.81 GB/s, avg_t=923.92 us, min_t=908.48 us, max_t=949.73 us [rank 27] Dispatch bandwidth: 6.12 GB/s, avg_t=607.51 us | Combine bandwidth: 30.74 GB/s, avg_t=234.09 us [rank 24] Dispatch bandwidth: 5.95 GB/s, avg_t=624.99 us | Combine bandwidth: 32.76 GB/s, avg_t=219.69 us [rank 28] Dispatch bandwidth: 6.05 GB/s, avg_t=614.50 us | Combine bandwidth: 31.71 GB/s, avg_t=226.95 us [rank 25] Dispatch bandwidth: 5.91 GB/s, avg_t=629.47 us | Combine bandwidth: 33.77 GB/s, avg_t=213.11 us [rank 29] Dispatch bandwidth: 6.04 GB/s, avg_t=616.03 us | Combine bandwidth: 31.91 GB/s, avg_t=225.54 us [rank 26] Dispatch bandwidth: 5.94 GB/s, avg_t=627.61 us | Combine bandwidth: 33.70 GB/s, avg_t=213.99 us [rank 30] Dispatch bandwidth: 5.98 GB/s, avg_t=623.24 us | Combine bandwidth: 32.99 GB/s, avg_t=218.60 us [rank 31] Dispatch bandwidth: 6.11 GB/s, avg_t=608.46 us | Combine bandwidth: 30.86 GB/s, avg_t=233.22 us [rank 27] Dispatch send/recv time: 28.63 us | Combine send/recv time: 29.32 us [rank 26] Dispatch send/recv time: 30.38 us | Combine send/recv time: 28.56 us [rank 30] Dispatch send/recv time: 29.99 us | Combine send/recv time: 29.05 us [rank 24] Dispatch send/recv time: 32.35 us | Combine send/recv time: 30.39 us [rank 31] Dispatch send/recv time: 28.86 us | Combine send/recv time: 29.91 us [rank 28] Dispatch send/recv time: 29.40 us | Combine send/recv time: 30.20 us [rank 29] Dispatch send/recv time: 29.56 us | Combine send/recv time: 28.67 us [rank 25] Dispatch send/recv time: 29.20 us | Combine send/recv time: 29.04 us

Jun 17 '25 02:06 nannaer

When return_recv_hook is set to False, the kernel will synchronously wait for the communication to complete. However, when it is set to True, the kernel will exit immediately after submitting the send request and will handle data reception after computation is finished. For more details, please refer to: https://github.com/deepseek-ai/DeepEP?tab=readme-ov-file#example-use-in-inference-decoding. Therefore, the kernel execution time measured with return_recv_hook=False will be longer.

When return_recv_hook is set to False, Machine 1 has relatively less tokens to receive in dispatch stage, and has more tokens to receive in combine stage, so it has lower dispatch latency and higher combine latency.

When return_recv_hook is set to True, the kernel execution time mainly comes from submitting send requests in dispatch stage and accumulating the received data in combine stage. Machine 1 has relatively more tokens to send in dispatch stage and more tokens to accumulate in combine stage, so it has higher latency for both dispatch and combine

Jun 17 '25 03:06 sphish

When return_recv_hook is set to False, the kernel will synchronously wait for the communication to complete. However, when it is set to True, the kernel will exit immediately after submitting the send request and will handle data reception after computation is finished. For more details, please refer to: https://github.com/deepseek-ai/DeepEP?tab=readme-ov-file#example-use-in-inference-decoding. Therefore, the kernel execution time measured with return_recv_hook=False will be longer.

When return_recv_hook is set to False, Machine 1 has relatively less tokens to receive in dispatch stage, and has more tokens to receive in combine stage, so it has lower dispatch latency and higher combine latency.

When return_recv_hook is set to True, the kernel execution time mainly comes from submitting send requests in dispatch stage and accumulating the received data in combine stage. Machine 1 has relatively more tokens to send in dispatch stage and more tokens to accumulate in combine stage, so it has higher latency for both dispatch and combine

Timeline of return_recv_hook = True

Delete the calculation function used to hide communication.

Thank you very much for your detailed explanation! I have checked the timeline of return_recv_hook = True. May I understand it in the following way?

（1）The delay in dispatch shown in the figure refers to the time for submitting the RDMA transmission kernel, which does not include the actual SEND/RECV time, as SEND/RECV does not occupy the SM when return_recv_hook = True. Therefore, in the log, when return_recv_hook = True, both the dispatch and combine times for Machine 1 with a larger batch size are longer, possibly due to longer preprocessing time.

After deleting the computation kernel used to hide communication, the time for return_recv_hook = True and return_recv_hook = False is basically the same.

（2）return_recv_hook = True is only useful when two micro-batch is enabled, so that computation can overlap with communication.

Jun 17 '25 12:06 nannaer

Yeah, you got it right.

Jun 17 '25 13:06 sphish

When return_recv_hook is set to False, the kernel will synchronously wait for the communication to complete. However, when it is set to True, the kernel will exit immediately after submitting the send request and will handle data reception after computation is finished. For more details, please refer to: https://github.com/deepseek-ai/DeepEP?tab=readme-ov-file#example-use-in-inference-decoding. Therefore, the kernel execution time measured with return_recv_hook=False will be longer.

When return_recv_hook is set to False, Machine 1 has relatively less tokens to receive in dispatch stage, and has more tokens to receive in combine stage, so it has lower dispatch latency and higher combine latency.

When return_recv_hook is set to True, the kernel execution time mainly comes from submitting send requests in dispatch stage and accumulating the received data in combine stage. Machine 1 has relatively more tokens to send in dispatch stage and more tokens to accumulate in combine stage, so it has higher latency for both dispatch and combine

Good morning! Thank you again for your detailed explanation! ! ! However, I still don't fully understand why when return_recv_hook is set to False, machine 1 (the one with a larger number of tokens) has lower dispatch latency but higher combine latency. Is this due to the reason you mentioned earlier: "Since the time spent submitting send requests accounts for only a small portion, the main source of dispatch latency here comes from the time spent receiving data. The amount of data to be received is small, so the latency is also low." Why does the time spent submitting send requests account for only a small portion?

Additionally, when return_recv_hook is set to False, the kernel will synchronously wait for the communication to complete. Does this means that dispatch of a RANK must wait for the slower of the two operations, dispatch SEND and dispatch RECV, to finish?

Jun 17 '25 17:06 nannaer

When return_recv_hook is set to False, the kernel will synchronously wait for the communication to complete. However, when it is set to True, the kernel will exit immediately after submitting the send request and will handle data reception after computation is finished. For more details, please refer to: https://github.com/deepseek-ai/DeepEP?tab=readme-ov-file#example-use-in-inference-decoding. Therefore, the kernel execution time measured with return_recv_hook=False will be longer.

When return_recv_hook is set to False, Machine 1 has relatively less tokens to receive in dispatch stage, and has more tokens to receive in combine stage, so it has lower dispatch latency and higher combine latency.

When return_recv_hook is set to True, the kernel execution time mainly comes from submitting send requests in dispatch stage and accumulating the received data in combine stage. Machine 1 has relatively more tokens to send in dispatch stage and more tokens to accumulate in combine stage, so it has higher latency for both dispatch and combine

In addition, I would like to ask you another question. When return_recv_hook=False , will the RANK that completes dispatch first wait for the RANK with higher dispatch latency to finish, and then all RANKs proceed together to execute the expert layer computation? Where does low_latency_dispatch/combine perform synchronization across all RANKs? https://github.com/deepseek-ai/DeepEP/issues/220

Once again, I express my heartfelt thanks to you!!!

Jun 17 '25 17:06 nannaer

When return_recv_hook is set to False, the kernel will synchronously wait for the communication to complete. However, when it is set to True, the kernel will exit immediately after submitting the send request and will handle data reception after computation is finished. For more details, please refer to: https://github.com/deepseek-ai/DeepEP?tab=readme-ov-file#example-use-in-inference-decoding. Therefore, the kernel execution time measured with return_recv_hook=False will be longer.

When return_recv_hook is set to False, Machine 1 has relatively less tokens to receive in dispatch stage, and has more tokens to receive in combine stage, so it has lower dispatch latency and higher combine latency.

When return_recv_hook is set to True, the kernel execution time mainly comes from submitting send requests in dispatch stage and accumulating the received data in combine stage. Machine 1 has relatively more tokens to send in dispatch stage and more tokens to accumulate in combine stage, so it has higher latency for both dispatch and combine

🥺🥺🥺

Jun 18 '25 11:06 nannaer

However, I still don't fully understand why when return_recv_hook is set to False, machine 1 (the one with a larger number of tokens) has lower dispatch latency but higher combine latency. Is this due to the reason you mentioned earlier: "Since the time spent submitting send requests accounts for only a small portion, the main source of dispatch latency here comes from the time spent receiving data. The amount of data to be received is small, so the latency is also low." Why does the time spent submitting send requests account for only a small portion?

As I mentioned above, the reason is that sending is asynchronous and does not need to wait for completion.

Additionally, when return_recv_hook is set to False, the kernel will synchronously wait for the communication to complete. Does this means that dispatch of a RANK must wait for the slower of the two operations, dispatch SEND and dispatch RECV, to finish?

SEND and RECV operations are serialized; you can refer to the code for details.

When return_recv_hook=False , will the RANK that completes dispatch first wait for the RANK with higher dispatch latency to finish, and then all RANKs proceed together to execute the expert layer computation?

No.

Where does low_latency_dispatch/combine perform synchronization across all RANKs?

You do not need to perform additional synchronization; you can refer to the test code for reference.

Jun 19 '25 01:06 sphish

However, I still don't fully understand why when return_recv_hook is set to False, machine 1 (the one with a larger number of tokens) has lower dispatch latency but higher combine latency. Is this due to the reason you mentioned earlier: "Since the time spent submitting send requests accounts for only a small portion, the main source of dispatch latency here comes from the time spent receiving data. The amount of data to be received is small, so the latency is also low." Why does the time spent submitting send requests account for only a small portion?

As I mentioned above, the reason is that sending is asynchronous and does not need to wait for completion.

Additionally, when return_recv_hook is set to False, the kernel will synchronously wait for the communication to complete. Does this means that dispatch of a RANK must wait for the slower of the two operations, dispatch SEND and dispatch RECV, to finish?

SEND and RECV operations are serialized; you can refer to the code for details.

When return_recv_hook=False , will the RANK that completes dispatch first wait for the RANK with higher dispatch latency to finish, and then all RANKs proceed together to execute the expert layer computation?

No.

Where does low_latency_dispatch/combine perform synchronization across all RANKs?

You do not need to perform additional synchronization; you can refer to the test code for reference.

Thank you for your serious and detailed explanation!

Jun 19 '25 05:06 nannaer