Hanpeng comments

Results 7 comments of


                                            Hanpeng

the question about byteps's timeline

@wuyujiji In BytePS, tensor communication can be divided into the following steps ``` "COORDINATE_REDUCE", "REDUCE", "COPYD2H", "PCIE_REDUCE", "COORDINATE_PUSH", "PUSH", "PULL", "COPYH2D", "COORDINATE_BROADCAST", "BROADCAST" ``` Here `REDUCE` and `BROADCAST` correspond to...

the question about byteps's timeline

@wuyujiji Right. For the GPU responsible for synchronizing with the PS, the communication time is `REDUCE time + BROADCAST time + PUSH time + PULL time`, for the other GPUs,...

the question about byteps's timeline

@wuyujiji If you want pure communication time, then these gaps should not be counted. If you want to get the communication time of the original large tensor, maybe you can...

the question about byteps's timeline

@wuyujiji FYI. You can refer to https://github.com/bytedance/byteps/blob/master/byteps/common/global.cc to see how `BYTEPS_PARTITION_BYTES` works

the question about byteps's timeline

@wuyujiji Actually current profiling method can only capture correct PUSH start timestamps and PULL end timestamps. That's why current gaps between PUSHs and PULLs are very small. Actually, we are...

the question about byteps's timeline

> If we regard the push time and pull time as a whole, the timeline shows that the time-consuming of the fourth sliced tensor is significantly shorter than the first...

Why does the replayer not use a global queue

Because the one-queue method can NOT generate correct topological sorting in some cases. For example, at time point t_0, worker A finishes op A_0, making A_1, A_2 ready to run;...