hivemind icon indicating copy to clipboard operation
hivemind copied to clipboard

[BUG][MINOR] Downloading state during averaging (and vice versa)

Open justheuristic opened this issue 3 years ago • 1 comments

(reported by CALM volunteers)

Describe the bug This happens to a new peer that joins training while others are averaging parameters. Since all peers are averaging parameters, the newbie peer will be stuck in the following loop:

  • newcomer requests state from a random peer
  • that peer is busy averaging parameters and will get stuck at this line: averager.py:658 (the lock for get_tensors is blocked by state_averager.step)
  • newcomer gets TimeoutError because target did not respond within next_chunk_timeout
  • newcomer prints an error message and tries again with the new peer -- that is also busy

This repeats for `floor(averaging_time / next_chunk_timeout) times until state averaging is done. Then it proceeds normally. In the worst case, if newcomer tries once for every other peer, it will skip initial load_state_from_peers. However, it will still detect being out-of-sync and retry.

To Reproduce

  • state_averager.step takes more time than next_chunk_timeout

This is how it looks from user's perspective image

This is how it looks on an auxiliary peer: image

Environment

This behavior is an algorithmic side-effect of how averager is implemented in hivemind. It should not depend on python/pytorch versions.

  • python version: 3.7 (or any other)
  • hivemind.version: master (1.1.0.dev0)
  • pytorch version: 1.10, numpy irrelevant

Possible solutions (non-exhaustive)

  • newcomer: somehow detect when state averaging is in progress and wait for up to averaging_timeout seconds?
  • add an option to not acquire lock during load_state_from_peers (this works fine now, but may be unsafe for some optimizers / averagers)

justheuristic avatar Jan 14 '22 00:01 justheuristic

I'm encountering this issue, are there any workarounds?

blurry-mood avatar Jan 04 '23 21:01 blurry-mood