disco
disco copied to clipboard
Decentralized Learning Implementation Fails to Add New Peers
EDIT:
The connection problem occurs at a later stage. It is slightly different from what we thought. More refined tests in /server/tests/end_to_end_decentralized.ts
show that:
- After 2
insecureDecentralized
clients (A and B) connect to the same task, a 2-way peer-to-peer connection between them is established (presumed succesfully). At least, they each have aSimplePeer
instance (calledpeer
) which corresponds to the other client, andpeer.connected
istrue
for both of them. - Up to the first round of communication, both clients train in parallel. Each trainerlogger message is printed twice in a row, with exact same accuracy and loss (each client prints it once, and both are doing the exact same thing).
- However, during the first round of communication, one of the client's peer has
peer.connected
set tofalse
! Specifically, if client A initiated the connection, then client B's peer for A haspeer.connected
set tofalse
. - Consequently, no averaging of weights takes place.
- Subsequently, the clients no longer train at the same time. Instead, they take turns, with only one client training between any two subsequent communication rounds.
- Starting from the second round of communication, both clients' peers have
peer.connected
set tofalse
.
To Reproduce
Run the server tests in branch 9-secAgg-Alyssa-Felix
:
#navigate to disco
git fetch
git checkout 9-secAgg-Alyssa-Felix
git pull
cd discojs
npm i --dev
npm run build
cd ../server
npm i --dev
npm run test
Expected behavior After training begins, we expect:
-
Is peer <num> connected?
always followed bytrue
. - At least once,
Aggregating a set of 2 weights.
Instead, Is peer <num> connected?
is followed by false
5 out of the total 6 times (only counting after training has begun), and we only ever see Aggregating a set of 1 weights.
Updated - we still have an issue, but we have a slightly better idea of when/where it occurs.
It looks like connection initiators only keep seeing their peers connected after 'on connect' event. The non-initiators loose the connection right after 'onconnect'.
just like #431, @tharvik's implementation of decentralized learning fixed this issue :)