elastic
elastic copied to clipboard
User loss of work if the cluster change occurs in the middle of the epoch
Description
Currently, when the cluster membership change occurs, the agent will kill all the workers that run users script, perform rank redistribution and spawn them again. Since there is neither feedback mechanism nor communication protocol between workers and agent, the user can lose computational work since the last checkpoint.
Hi, I'm running distributed training with torchelastic (thanks a lot for the amazing work btw!), and I have very long epochs. So any change in the number of workers (or when using preemptible nodes) results in large computation waste since last checkpoint. Is there any update on this issue ? Or any hint to a workaround for now ? Would it be possible to detect when a worker group is about to stop ?