bluefog icon indicating copy to clipboard operation
bluefog copied to clipboard

How to deal with the case that when one or some processes are much faster than others

Open BichengYing opened this issue 5 years ago • 1 comments

Because of the essence of one-sided communication, the progress of different processes may vary a lot, especially under the heterogeneous environment. If simply write the code like for e in range(epochs): xxx some_collective_ops

Then, the last collective ops will waste the advantage of one-sided communication. We need a better way to design the code or deal with this situation.

BichengYing avatar Apr 14 '20 06:04 BichengYing

Thoughts: 1. Use barrier function every N iterations, which can be useful for unstable performance but not useful for heterogeneous situation. 2. Run for a very long time and relied on the early stopping technology, whichever node/agent achieve the stopping criteria, sending a stop signal to the others and use the model of that agent as the final result.

BichengYing avatar Apr 28 '20 03:04 BichengYing