returnn
returnn copied to clipboard
PyTorch Distributed Training: File descriptors opened and never closed
When using distributed training w/ PyTorch v2.30, RETURNN opens file descriptors every subepoch that are then never closed.
Comparing the output of lsof -l for the same process w/ a few subepochs difference one gets diffs like this:
with blocks of FDs like below repeating:
pt_main_t 13339 17144239 100r FIFO 0,13 0t0 2311492 pipe
pt_main_t 13339 17144239 101w FIFO 0,13 0t0 2311492 pipe
pt_main_t 13339 17144239 102u a_inode 0,14 0 13849 [eventpoll]
pt_main_t 13339 17144239 103u IPv4 2311493 0t0 TCP g-02.apptek.local:33695 (LISTEN)
pt_main_t 13339 17144239 104u IPv4 2311497 0t0 TCP g-02.apptek.local:33695->localhost:34500 (ESTABLISHED)
Using the pipe FD 101 in the example above and some quick and dirty tracing code (see below) that I wrapped around various functions in the Pytorch engine I was able to narrow down the pipe FD's construction to
-> Engine.train_epoch
-> DataLoader.__iter__
-> DataLoader._get_iterator
-> _SingleProcessDataLoaderIter.__init__
-> _BaseDataLoaderIter.__init__
-> torch.distributed.new_group(backend="gloo")
-> torch.distributed._new_group_with_tag
-> torch.distributed._new_process_group_helper
-> backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
-> native code
The only reason for existence of this process group is for distributing a random seed between the worker group members. Note, in this example it does not make a difference if we use num_workers > 0 or not as both the _SingleProcessDataLoaderIter and _MultiProcessDataLoaderIter inherit from _BaseDataLoaderIter. What I'm wondering about is why it's not destructed and the FDs closed once the data loader iter goes out of scope.
Torch allows re-using workers using the "persistent_workers"-flag, and in that case the workers are not recreated every subepoch but just reset, which means Torch won't create new process groups and the FDs won't accumulate. This can be a workaround for now.
lsof-Diff with num_workers=1, persistent_workers=True (notice the missing green blocks of added lines visible in the other diff):
EDIT: I found out we already default to setting persistent_workers=True when num_workers is set. So in the case with torch_dataloader_opts = {"num_workers": 1} we are good.
Tracing code:
from .basic import is_valid_fd
from returnn.log import log
num_fd = 101
def fd(name):
def decorator(func):
name_ = name or func.__name__
def unfold(val):
for v in val:
yield v
print(f"after value {name_}, FD {num_fd} is {is_valid_fd(num_fd)}", file=log.v3)
def wrapped(*args, **kwargs):
print(f"before {name_}, FD {num_fd} is {is_valid_fd(num_fd)}", file=log.v3)
try:
val = func(*args, **kwargs)
return unfold(val) if hasattr(val, "__next__") else val
finally:
print(f"after {name_}, FD {num_fd} is {is_valid_fd(num_fd)}", file=log.v3)
return wrapped
return decorator
from contextlib import contextmanager
@contextmanager
def fd_c(name):
try:
print(f"before {name}, FD {num_fd} is {is_valid_fd(num_fd)}", file=log.v3)
yield
finally:
print(f"after {name}, FD {num_fd} is {is_valid_fd(num_fd)}", file=log.v3)
(For future reference for debugging: I think using sys.settrace would have been easier to find the place instead of such contextmanager and hacking around in the code. With settrace you would not need to modify any code. And once the fd opens, you could simply use the traceback module to print the stacktrace.)
This sounds like a problem that other PyTorch users will also have run into, or not? Did you search for corresponding PyTorch issues?
Or if you don't find any, can you report this upstream?
I don't exactly understand your comment regarding persistent_workers.
In case of num_workers > 0, we already set persistent_workers=True, so there was no problem
In case of num_workers == 0, you say that we also should use persistent_workers=True? But that does not make sense. In DataLoader.__init__:
if persistent_workers and num_workers == 0:
raise ValueError('persistent_workers option needs num_workers > 0')
So what you actually say is that using num_workers > 0 is the workaround for now? persistent_workers does not need to be set.
But also, looking at _BaseDataLoaderIter.__init__, the logic is completely independent of persistent_workers and also independent of num_workers? It seems this dist group is just always created?
if isinstance(self._dataset, IterDataPipe):
if dist.is_available() and dist.is_initialized():
self._pg = dist.new_group(backend="gloo")
Edit Ah, wrong. In DataLoader:
def __iter__(self) -> '_BaseDataLoaderIter':
# When using a single worker the returned iterator should be
# created everytime to avoid resetting its state
# However, in the case of a multiple workers iterator
# the iterator is only created once in the lifetime of the
# DataLoader object so that workers can be reused
if self.persistent_workers and self.num_workers > 0:
if self._iterator is None:
self._iterator = self._get_iterator()
else:
self._iterator._reset(self)
return self._iterator
else:
return self._get_iterator()
So it would reuse the iterator object and not recreate it.
So yes, using torch_dataloader_opts = {"num_workers": 1} is currently a good workaround (and anyway what I would always recommend to use).
So what you actually say is that using num_workers > 0 is the workaround for now? persistent_workers does not need to be set.
Yes, exactly, because we have the default in RETURNN. If it wasn't for that, we'd be seeing the issue with or without num_workers > 0 and it would only go away in the case of num_workers > 0, persistent_workers=True.
(In my initial tests I was starting w/ num_workers: 0 to get the most simple distributed setup, and that saw the bug. When I then experimented w/ the code in iter and set num_workers=1, persistent_workers=True, did not see the issue anymore and only then discovered the default in RETURNN.)
So it would reuse the iterator object and not recreate it.
Correct.
I filed https://github.com/pytorch/pytorch/issues/129868
Another workaround on RETURNN side would be to cache the iterator, and not recreate it always. But this might also be tricky to get right. I think we should first wait for the conclusion of https://github.com/pytorch/pytorch/issues/129868.
Apparently this is a won‘t fix on the torch side. Our workaround is the officially recommended solution.
Ok, so also num_workers > 0 then. So maybe we also should set num_workers = 1 when the user does not specify otherwise?