mmcv
mmcv copied to clipboard
why need time.sleep(2) in EpochBasedRunner ? when the deadlock will happen ?
why need time.sleep(2)
in https://github.com/open-mmlab/mmcv/blob/master/mmcv/runner/epoch_based_runner.py#L46
the statement says: Prevent possible deadlock during epoch transition.
when possible deadlock will happen. and more impotant , time.sleep(2)
is not elegant and flexible operation. It waste time in small dataset. I thick it is necessary to redesign.
Hi, this is a workaround to resolve the possible deadlock in dataloader. More details can be found at https://github.com/pytorch/pytorch/issues/1355#issuecomment-517955232. We will find out a more elegant way to resolve the problem.
Hi @zhouzaida, Is the deadlock can also occur with single gpu training? I'm wondering about the root cause of inserting time.sleep(2) in mmcv.
Hi @zhouzaida, Is the deadlock can also occur with single gpu training? I'm wonder about the root cause of inserting time.sleep(2) in mmcv.
Hi @JihwanEom , in most cases the deadlock will not happen so this line can be removed from your local mmcv which will speed up your training.
Okay, I got it. But could you explain the expected situation for possible deadlock? I want to resolve this by analyzing the root cause. I can't imagine when removing time.sleep(2) can be dangerous for deadlock.
Hi, @zhouzaida. As @JihwanEom said, I wonder which cases the deadlock can occur when removing time.sleep(2)
. Even I trained the multi-gpu training disabling time.sleep(2), I couldn't find the deadlock issue and worked well. I'm trying to get benefits from multi-gpu training but sleep(2) seems like bottleneck for overall training time. Could you please explain the specific scenario(like dataset/pipeline) and the possibility that deadlock may happen if sleep is removed? and I also wonder why this exact linetime.sleep(2)
resolve all this.