data
data copied to clipboard
prefetcher shutdown hang when running with multiprocess + distributed reading service
🐛 Describe the bug
Prefetcher will hang indefinitely on shutdown(), the faulthandler stack traces indicates that main thread is blocked on https://github.com/pytorch/data/blob/main/torchdata/datapipes/iter/util/prefetcher.py#L113 while child thread is blocked on https://github.com/pytorch/data/blob/main/torchdata/datapipes/iter/util/prefetcher.py#L81, but I don't know why time.sleep
could block on exit.
Repro:
@functional_datapipe("frame_slicer")
class FrameSlicer(IterDataPipe):
def __init__(self, source_datapipe) -> None:
self.source_datapipe = source_datapipe
def __iter__(self):
for fields in self.source_datapipe:
video_id, seg_start, seg_end = fields
for i in range(int(seg_start), int(seg_end)+1):
yield (video_id, i)
def generate_entries():
lines = []
# start with a prime number to make sure we have uneven dataloaders
random.seed(10)
for i in range(37):
frame_count = random.randint(5, 10)
lines.append([f'video-{i}', 10, 10 + frame_count])
return lines
def build_one_datapipe():
entries = generate_entries()
total_frames = sum([x[2] - x[1] + 1 for x in entries])
dp = IterableWrapper(entries)
dp = dp.shuffle()
dp = dp.sharding_filter()
dp = dp.frame_slicer()
return dp, total_frames
def build_dataloader2():
dp, total_frames = build_one_datapipe()
mp_rs = MultiProcessingReadingService(num_workers=2)
dist_rs = DistributedReadingService()
rs = SequentialReadingService(dist_rs, mp_rs)
dl = DataLoader2(dp, reading_service=rs)
dl.seed(2)
counter = 0
video_ids = set()
for data in dl:
video_ids.add(data[0])
counter += 1
dl.shutdown() # hang here
Versions
PyTorch version: 2.0.0a0+gite9ebda2
Is debug build: False
CUDA used to build PyTorch: 12.0
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.3 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: 12.0.1 (https://github.com/conda-forge/clangdev-feedstock d44358f44aef33e9fa7c5f93e2481ee8f1a04ab6)
CMake version: version 3.19.1
Libc version: glibc-2.31
Python version: 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:10) [GCC 10.3.0] (64-bit runtime)
Python platform: Linux-5.4.0-64-generic-x86_64-with-glibc2.10
Is CUDA available: False
CUDA runtime version: 12.0.140
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: False
Versions of relevant libraries:
[pip3] mypy-extensions==1.0.0
[pip3] mypy-protobuf==3.3.0
[pip3] numpy==1.23.5
[pip3] pytorch3d==0.6.2
[pip3] torch==2.0.1+1684801906.cuda120.cudnn891.nccl218.ap
[pip3] torch-mlir==1684442443
[pip3] torch-scatter==2.1.0
[pip3] torch-tb-profiler==0.4.1
[pip3] torchdata==0.7.0.dev20230601
[pip3] torchfile==0.1.0
[pip3] torchvision==0.15.1a0+42759b1
[conda] magma-cuda121 2.6.1 1 pytorch
[conda] mkl 2020.4 h726a3e6_304 conda-forge
[conda] mkl-include 2023.1.0 h84fe81f_48680 conda-forge
[conda] numpy 1.23.5 py38h7042d01_0 conda-forge
[conda] pytorch3d 0.6.2 pypi_0 pypi
[conda] torch 2.0.1+1684801906.cuda120.cudnn891.nccl218.ap pypi_0 pypi
[conda] torch-mlir 1684442443 pypi_0 pypi
[conda] torch-scatter 2.1.0 pypi_0 pypi
[conda] torch-tb-profiler 0.4.1 pypi_0 pypi
[conda] torchfile 0.1.0 pypi_0 pypi
[conda] torchvision 0.15.1a0+42759b1 pypi_0 pypi