DANet icon indicating copy to clipboard operation
DANet copied to clipboard

possible deadlock in dataloader

Open nankepan opened this issue 2 years ago • 3 comments

HI, When I train a model with num_workers>1, it is possible to stuck on this line: https://github.com/scutpaul/DANet/blob/f0bc57d9b2641c4dda9ce70e2c6f240ce2789069/test_DAN.py#L137 Then I debug and find that it stucks on this tow lines: https://github.com/scutpaul/DANet/blob/f0bc57d9b2641c4dda9ce70e2c6f240ce2789069/libs/dataset/YoutubeVOS.py#L156 https://github.com/scutpaul/DANet/blob/f0bc57d9b2641c4dda9ce70e2c6f240ce2789069/libs/dataset/YoutubeVOS.py#L157 When I train a model when num_workers=0, it is normal but very slow.

The problem is similiar with this: https://github.com/pytorch/pytorch/issues/1355. And I can not fix it using methods under this issue.
How can I fix the problem?

nankepan avatar May 16 '22 09:05 nankepan

hi, if you need to train the model, you should use train_DAN.py. the default setting for training num_workers is 4 https://github.com/scutpaul/DANet/blob/f0bc57d9b2641c4dda9ce70e2c6f240ce2789069/train_DAN.py#L46 https://github.com/scutpaul/DANet/blob/f0bc57d9b2641c4dda9ce70e2c6f240ce2789069/train_DAN.py#L82

scutpaul avatar May 16 '22 09:05 scutpaul

I did use train_DAN.py and set num_workers=4.Then sometime it is possible to stuck.

nankepan avatar May 16 '22 10:05 nankepan

hi, you can download our conda yaml to create the python env. FSVOS.yaml.zip

scutpaul avatar May 16 '22 10:05 scutpaul