returnn PyTorch Distributed Training: File descriptors opened and never closed

trafficstars

When using distributed training w/ PyTorch v2.30, RETURNN opens file descriptors every subepoch that are then never closed.

Comparing the output of lsof -l for the same process w/ a few subepochs difference one gets diffs like this:

with blocks of FDs like below repeating:

pt_main_t 13339 17144239  100r     FIFO               0,13      0t0             2311492 pipe
pt_main_t 13339 17144239  101w     FIFO               0,13      0t0             2311492 pipe
pt_main_t 13339 17144239  102u  a_inode               0,14        0               13849 [eventpoll]
pt_main_t 13339 17144239  103u     IPv4            2311493      0t0                 TCP g-02.apptek.local:33695 (LISTEN)
pt_main_t 13339 17144239  104u     IPv4            2311497      0t0                 TCP g-02.apptek.local:33695->localhost:34500 (ESTABLISHED)

Using the pipe FD 101 in the example above and some quick and dirty tracing code (see below) that I wrapped around various functions in the Pytorch engine I was able to narrow down the pipe FD's construction to

-> Engine.train_epoch
-> DataLoader.__iter__
-> DataLoader._get_iterator
-> _SingleProcessDataLoaderIter.__init__
-> _BaseDataLoaderIter.__init__
-> torch.distributed.new_group(backend="gloo")
-> torch.distributed._new_group_with_tag
-> torch.distributed._new_process_group_helper
-> backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
-> native code

The only reason for existence of this process group is for distributing a random seed between the worker group members. Note, in this example it does not make a difference if we use num_workers > 0 or not as both the _SingleProcessDataLoaderIter and _MultiProcessDataLoaderIter inherit from _BaseDataLoaderIter. What I'm wondering about is why it's not destructed and the FDs closed once the data loader iter goes out of scope.

Torch allows re-using workers using the "persistent_workers"-flag, and in that case the workers are not recreated every subepoch but just reset, which means Torch won't create new process groups and the FDs won't accumulate. This can be a workaround for now.

lsof-Diff with num_workers=1, persistent_workers=True (notice the missing green blocks of added lines visible in the other diff):

EDIT: I found out we already default to setting persistent_workers=True when num_workers is set. So in the case with torch_dataloader_opts = {"num_workers": 1} we are good.

Tracing code:

from .basic import is_valid_fd
from returnn.log import log

num_fd = 101


def fd(name):

    def decorator(func):
        name_ = name or func.__name__

        def unfold(val):
            for v in val:
                yield v
                print(f"after value {name_}, FD {num_fd} is {is_valid_fd(num_fd)}", file=log.v3)

        def wrapped(*args, **kwargs):
            print(f"before {name_}, FD {num_fd} is {is_valid_fd(num_fd)}", file=log.v3)
            try:
                val = func(*args, **kwargs)
                return unfold(val) if hasattr(val, "__next__") else val
            finally:
                print(f"after {name_}, FD {num_fd} is {is_valid_fd(num_fd)}", file=log.v3)

        return wrapped

    return decorator


from contextlib import contextmanager


@contextmanager
def fd_c(name):
    try:
        print(f"before {name}, FD {num_fd} is {is_valid_fd(num_fd)}", file=log.v3)
        yield
    finally:
        print(f"after {name}, FD {num_fd} is {is_valid_fd(num_fd)}", file=log.v3)

Jul 01 '24 08:07 NeoLegends

(For future reference for debugging: I think using sys.settrace would have been easier to find the place instead of such contextmanager and hacking around in the code. With settrace you would not need to modify any code. And once the fd opens, you could simply use the traceback module to print the stacktrace.)

Jul 01 '24 08:07 albertz

This sounds like a problem that other PyTorch users will also have run into, or not? Did you search for corresponding PyTorch issues?

Or if you don't find any, can you report this upstream?

Jul 01 '24 08:07 albertz

I don't exactly understand your comment regarding persistent_workers.

In case of num_workers > 0, we already set persistent_workers=True, so there was no problem

In case of num_workers == 0, you say that we also should use persistent_workers=True? But that does not make sense. In DataLoader.__init__:

if persistent_workers and num_workers == 0:
    raise ValueError('persistent_workers option needs num_workers > 0')

So what you actually say is that using num_workers > 0 is the workaround for now? persistent_workers does not need to be set.

But also, looking at _BaseDataLoaderIter.__init__, the logic is completely independent of persistent_workers and also independent of num_workers? It seems this dist group is just always created?

if isinstance(self._dataset, IterDataPipe):
    if dist.is_available() and dist.is_initialized():
        self._pg = dist.new_group(backend="gloo")

Edit Ah, wrong. In DataLoader:

def __iter__(self) -> '_BaseDataLoaderIter':
    # When using a single worker the returned iterator should be
    # created everytime to avoid resetting its state
    # However, in the case of a multiple workers iterator
    # the iterator is only created once in the lifetime of the
    # DataLoader object so that workers can be reused
    if self.persistent_workers and self.num_workers > 0:
        if self._iterator is None:
            self._iterator = self._get_iterator()
        else:
            self._iterator._reset(self)
        return self._iterator
    else:
        return self._get_iterator()

So it would reuse the iterator object and not recreate it.

So yes, using torch_dataloader_opts = {"num_workers": 1} is currently a good workaround (and anyway what I would always recommend to use).

Jul 01 '24 08:07 albertz

So what you actually say is that using num_workers > 0 is the workaround for now? persistent_workers does not need to be set.

Yes, exactly, because we have the default in RETURNN. If it wasn't for that, we'd be seeing the issue with or without num_workers > 0 and it would only go away in the case of num_workers > 0, persistent_workers=True.

(In my initial tests I was starting w/ num_workers: 0 to get the most simple distributed setup, and that saw the bug. When I then experimented w/ the code in iter and set num_workers=1, persistent_workers=True, did not see the issue anymore and only then discovered the default in RETURNN.)

So it would reuse the iterator object and not recreate it.

Correct.

Jul 01 '24 09:07 NeoLegends

I filed https://github.com/pytorch/pytorch/issues/129868

Jul 01 '24 12:07 NeoLegends

Another workaround on RETURNN side would be to cache the iterator, and not recreate it always. But this might also be tricky to get right. I think we should first wait for the conclusion of https://github.com/pytorch/pytorch/issues/129868.

Jul 01 '24 12:07 albertz

Apparently this is a won‘t fix on the torch side. Our workaround is the officially recommended solution.

Jul 01 '24 18:07 NeoLegends

Ok, so also num_workers > 0 then. So maybe we also should set num_workers = 1 when the user does not specify otherwise?

Jul 01 '24 19:07 albertz

returnn returnn copied to clipboard

PyTorch Distributed Training: File descriptors opened and never closed

returnn
returnn copied to clipboard