webdataset
webdataset copied to clipboard
too many files when using multiprocessing with pytorch dataloader and `decode`
i noticed after training for a while i was getting an OSError that there were too many files created. i think i narrowed this down to be an issue with .decode and the following script shows the behavior:
import os
import webdataset as wds
import torch
num_workers = 4
decode = False
if __name__ == "__main__":
tar_path = "test_image.tar"
writer = wds.TarWriter(tar_path)
for i in range(100):
writer.write({"__key__": str(i), "input.pyd": i, "output.pyd": i})
writer.close()
dataset = wds.WebDataset(tar_path)
if decode:
dataset = dataset.decode()
dataset = dataset.to_tuple("input.pyd")
dl = torch.utils.data.DataLoader(
dataset, batch_size=10, num_workers=num_workers
)
for _ in range(10):
os.system(
f"lsof -p {os.getpid()} -a -d ^mem -d ^cwd -d ^rtd -d ^txt -d ^DEL -w | wc -l"
)
for x in dl:
pass
Running this gives output like
4 5 5 5 ...
i.e. the number of files doesn't grow after each iteration through dl.
however, changing decode to True gives output like
4 6 7 8 ...
so new files are being created after every epoch. however, changing num_workers to 0 does not have this problem.
i'm on macos with webdataset==0.2.57 and torch==2.0.1