hivemind
hivemind copied to clipboard
optimize load_state_from_peers
problem: if many peers join at once, they will all pick one averager (latest at the time) as a target for loading initial state. This is causes choke points as one averager struggles to service many newcomers.
possible solution:
- [ ] modify the RPC load state so that each averager can only service T clients at a time. All subsequent clients get enqueued and wait.
- [x] modify DecentralizedAverager.load_state_from_peers to dynamically switch away to alternative donor averagers after the first one sends you away
- [ ] [optional] also negotiate for bandwidth: donors reveal their free bandwidth and clients can pick the best host
- [x] remove split_for_streaming/combine_from_streaming
Also, remove any remaining mentions of tensor chunks from the codebase