hivemind
hivemind copied to clipboard
[BUG][MINOR] Downloading state during averaging (and vice versa)
(reported by CALM volunteers)
Describe the bug This happens to a new peer that joins training while others are averaging parameters. Since all peers are averaging parameters, the newbie peer will be stuck in the following loop:
- newcomer requests state from a random peer
- that peer is busy averaging parameters and will get stuck at this line: averager.py:658 (the lock for get_tensors is blocked by state_averager.step)
- newcomer gets TimeoutError because target did not respond within next_chunk_timeout
- newcomer prints an error message and tries again with the new peer -- that is also busy
This repeats for `floor(averaging_time / next_chunk_timeout) times until state averaging is done. Then it proceeds normally. In the worst case, if newcomer tries once for every other peer, it will skip initial load_state_from_peers. However, it will still detect being out-of-sync and retry.
To Reproduce
-
state_averager.step
takes more time thannext_chunk_timeout
This is how it looks from user's perspective
This is how it looks on an auxiliary peer:
Environment
This behavior is an algorithmic side-effect of how averager is implemented in hivemind. It should not depend on python/pytorch versions.
- python version: 3.7 (or any other)
- hivemind.version: master (1.1.0.dev0)
- pytorch version: 1.10, numpy irrelevant
Possible solutions (non-exhaustive)
- newcomer: somehow detect when state averaging is in progress and wait for up to averaging_timeout seconds?
- add an option to not acquire lock during load_state_from_peers (this works fine now, but may be unsafe for some optimizers / averagers)
I'm encountering this issue, are there any workarounds?