disco icon indicating copy to clipboard operation
disco copied to clipboard

Decentralized Learning Implementation Fails to Add New Peers

Open aunell opened this issue 2 years ago • 2 comments

EDIT: The connection problem occurs at a later stage. It is slightly different from what we thought. More refined tests in /server/tests/end_to_end_decentralized.ts show that:

  1. After 2 insecureDecentralized clients (A and B) connect to the same task, a 2-way peer-to-peer connection between them is established (presumed succesfully). At least, they each have a SimplePeer instance (called peer) which corresponds to the other client, and peer.connected is true for both of them.
  2. Up to the first round of communication, both clients train in parallel. Each trainerlogger message is printed twice in a row, with exact same accuracy and loss (each client prints it once, and both are doing the exact same thing).
  3. However, during the first round of communication, one of the client's peer has peer.connected set to false! Specifically, if client A initiated the connection, then client B's peer for A has peer.connected set to false.
  4. Consequently, no averaging of weights takes place.
  5. Subsequently, the clients no longer train at the same time. Instead, they take turns, with only one client training between any two subsequent communication rounds.
  6. Starting from the second round of communication, both clients' peers have peer.connected set to false.

To Reproduce Run the server tests in branch 9-secAgg-Alyssa-Felix :

#navigate to disco
git fetch
git checkout 9-secAgg-Alyssa-Felix
git pull
cd discojs
npm i --dev
npm run build
cd ../server
npm i --dev
npm run test

Expected behavior After training begins, we expect:

  • Is peer <num> connected? always followed by true.
  • At least once, Aggregating a set of 2 weights.

Instead, Is peer <num> connected? is followed by false 5 out of the total 6 times (only counting after training has begun), and we only ever see Aggregating a set of 1 weights.

aunell avatar Jun 30 '22 11:06 aunell

Updated - we still have an issue, but we have a slightly better idea of when/where it occurs.

Grim-bot avatar Jul 06 '22 17:07 Grim-bot

It looks like connection initiators only keep seeing their peers connected after 'on connect' event. The non-initiators loose the connection right after 'onconnect'.

Thity avatar Jul 15 '22 09:07 Thity

just like #431, @tharvik's implementation of decentralized learning fixed this issue :)

s314cy avatar Oct 25 '22 14:10 s314cy