webdataset icon indicating copy to clipboard operation
webdataset copied to clipboard

too many files when using multiprocessing with pytorch dataloader and `decode`

Open ekorman opened this issue 2 years ago • 2 comments

i noticed after training for a while i was getting an OSError that there were too many files created. i think i narrowed this down to be an issue with .decode and the following script shows the behavior:

import os
import webdataset as wds
import torch

num_workers = 4
decode = False

if __name__ == "__main__":
    tar_path = "test_image.tar"
    writer = wds.TarWriter(tar_path)

    for i in range(100):
        writer.write({"__key__": str(i), "input.pyd": i, "output.pyd": i})

    writer.close()

    dataset = wds.WebDataset(tar_path)
    if decode:
        dataset = dataset.decode()
    dataset = dataset.to_tuple("input.pyd")

    dl = torch.utils.data.DataLoader(
        dataset, batch_size=10, num_workers=num_workers
    )
    for _ in range(10):
        os.system(
            f"lsof -p {os.getpid()} -a -d ^mem -d ^cwd -d ^rtd -d ^txt -d ^DEL -w | wc -l"
        )
        for x in dl:
            pass

Running this gives output like

 4 5 5 5 ...

i.e. the number of files doesn't grow after each iteration through dl.

however, changing decode to True gives output like

4 6 7 8 ...

so new files are being created after every epoch. however, changing num_workers to 0 does not have this problem.

i'm on macos with webdataset==0.2.57 and torch==2.0.1

ekorman avatar Sep 30 '23 19:09 ekorman