hivemind optimize load_state_from

optimize load_state_from_peers

Open justheuristic opened this issue 4 years ago • 1 comments

problem: if many peers join at once, they will all pick one averager (latest at the time) as a target for loading initial state. This is causes choke points as one averager struggles to service many newcomers.

possible solution:

[ ] modify the RPC load state so that each averager can only service T clients at a time. All subsequent clients get enqueued and wait.
[x] modify DecentralizedAverager.load_state_from_peers to dynamically switch away to alternative donor averagers after the first one sends you away
[ ] [optional] also negotiate for bandwidth: donors reveal their free bandwidth and clients can pick the best host
[x] remove split_for_streaming/combine_from_streaming

Mar 03 '21 23:03 justheuristic

Also, remove any remaining mentions of tensor chunks from the codebase

Jun 28 '21 23:06 mryab

hivemind hivemind copied to clipboard

optimize load_state_from_peers

hivemind
hivemind copied to clipboard