hivemind icon indicating copy to clipboard operation
hivemind copied to clipboard

optimize load_state_from_peers

Open justheuristic opened this issue 4 years ago • 1 comments

problem: if many peers join at once, they will all pick one averager (latest at the time) as a target for loading initial state. This is causes choke points as one averager struggles to service many newcomers.

possible solution:

  • [ ] modify the RPC load state so that each averager can only service T clients at a time. All subsequent clients get enqueued and wait.
  • [x] modify DecentralizedAverager.load_state_from_peers to dynamically switch away to alternative donor averagers after the first one sends you away
  • [ ] [optional] also negotiate for bandwidth: donors reveal their free bandwidth and clients can pick the best host
  • [x] remove split_for_streaming/combine_from_streaming

justheuristic avatar Mar 03 '21 23:03 justheuristic

Also, remove any remaining mentions of tensor chunks from the codebase

mryab avatar Jun 28 '21 23:06 mryab