disco
disco copied to clipboard
dynamic networks / fault tolerant training algorithms
the training algorithm should support realistic changes of the communication graph, such as node failues or offline time. this issue here only considers non-malicious nodes. for Byzantine nodes, we'll discuss later in separate issues
we can experiment with some candidate algorithms from the following papers for example, and test them on the simulator.
-
A Unified Theory of Decentralized SGD with Changing Topology and Local Updateshttps://arxiv.org/pdf/2003.10422 -
SwarmSGD: Scalable Decentralized SGD with Local Updateshttps://arxiv.org/abs/1910.12308