Hanpeng
Hanpeng
@wuyujiji In BytePS, tensor communication can be divided into the following steps ``` "COORDINATE_REDUCE", "REDUCE", "COPYD2H", "PCIE_REDUCE", "COORDINATE_PUSH", "PUSH", "PULL", "COPYH2D", "COORDINATE_BROADCAST", "BROADCAST" ``` Here `REDUCE` and `BROADCAST` correspond to...
@wuyujiji Right. For the GPU responsible for synchronizing with the PS, the communication time is `REDUCE time + BROADCAST time + PUSH time + PULL time`, for the other GPUs,...
@wuyujiji If you want pure communication time, then these gaps should not be counted. If you want to get the communication time of the original large tensor, maybe you can...
@wuyujiji FYI. You can refer to https://github.com/bytedance/byteps/blob/master/byteps/common/global.cc to see how `BYTEPS_PARTITION_BYTES` works
@wuyujiji Actually current profiling method can only capture correct PUSH start timestamps and PULL end timestamps. That's why current gaps between PUSHs and PULLs are very small. Actually, we are...
> If we regard the push time and pull time as a whole, the timeline shows that the time-consuming of the fourth sliced tensor is significantly shorter than the first...
Because the one-queue method can NOT generate correct topological sorting in some cases. For example, at time point t_0, worker A finishes op A_0, making A_1, A_2 ready to run;...