elastic icon indicating copy to clipboard operation
elastic copied to clipboard

User loss of work if the cluster change occurs in the middle of the epoch

Open aivanou opened this issue 5 years ago • 1 comments

Description

Currently, when the cluster membership change occurs, the agent will kill all the workers that run users script, perform rank redistribution and spawn them again. Since there is neither feedback mechanism nor communication protocol between workers and agent, the user can lose computational work since the last checkpoint.

aivanou avatar Apr 28 '20 02:04 aivanou

Hi, I'm running distributed training with torchelastic (thanks a lot for the amazing work btw!), and I have very long epochs. So any change in the number of workers (or when using preemptible nodes) results in large computation waste since last checkpoint. Is there any update on this issue ? Or any hint to a workaround for now ? Would it be possible to detect when a worker group is about to stop ?

jnkl314 avatar Mar 19 '21 14:03 jnkl314