disco icon indicating copy to clipboard operation
disco copied to clipboard

dynamic networks / fault tolerant training algorithms

Open martinjaggi opened this issue 5 years ago • 0 comments

the training algorithm should support realistic changes of the communication graph, such as node failues or offline time. this issue here only considers non-malicious nodes. for Byzantine nodes, we'll discuss later in separate issues

we can experiment with some candidate algorithms from the following papers for example, and test them on the simulator.

  • A Unified Theory of Decentralized SGD with Changing Topology and Local Updates https://arxiv.org/pdf/2003.10422

  • SwarmSGD: Scalable Decentralized SGD with Local Updates https://arxiv.org/abs/1910.12308

martinjaggi avatar Jul 09 '20 22:07 martinjaggi