mmcv icon indicating copy to clipboard operation
mmcv copied to clipboard

why need time.sleep(2) in EpochBasedRunner ? when the deadlock will happen ?

Open Bilibilee opened this issue 2 years ago • 4 comments

why need time.sleep(2) in https://github.com/open-mmlab/mmcv/blob/master/mmcv/runner/epoch_based_runner.py#L46

the statement says: Prevent possible deadlock during epoch transition.

when possible deadlock will happen. and more impotant , time.sleep(2) is not elegant and flexible operation. It waste time in small dataset. I thick it is necessary to redesign.

Bilibilee avatar Jan 03 '22 06:01 Bilibilee

Hi, this is a workaround to resolve the possible deadlock in dataloader. More details can be found at https://github.com/pytorch/pytorch/issues/1355#issuecomment-517955232. We will find out a more elegant way to resolve the problem.

image

zhouzaida avatar Jan 04 '22 12:01 zhouzaida

Hi @zhouzaida, Is the deadlock can also occur with single gpu training? I'm wondering about the root cause of inserting time.sleep(2) in mmcv.

JihwanEom avatar Jul 08 '22 08:07 JihwanEom

Hi @zhouzaida, Is the deadlock can also occur with single gpu training? I'm wonder about the root cause of inserting time.sleep(2) in mmcv.

Hi @JihwanEom , in most cases the deadlock will not happen so this line can be removed from your local mmcv which will speed up your training.

zhouzaida avatar Jul 08 '22 12:07 zhouzaida

Okay, I got it. But could you explain the expected situation for possible deadlock? I want to resolve this by analyzing the root cause. I can't imagine when removing time.sleep(2) can be dangerous for deadlock.

JihwanEom avatar Jul 08 '22 13:07 JihwanEom

Hi, @zhouzaida. As @JihwanEom said, I wonder which cases the deadlock can occur when removing time.sleep(2). Even I trained the multi-gpu training disabling time.sleep(2), I couldn't find the deadlock issue and worked well. I'm trying to get benefits from multi-gpu training but sleep(2) seems like bottleneck for overall training time. Could you please explain the specific scenario(like dataset/pipeline) and the possibility that deadlock may happen if sleep is removed? and I also wonder why this exact linetime.sleep(2) resolve all this.

supersoob avatar Nov 08 '22 05:11 supersoob