webdataset Memory leak in caching

I have a dataset that has about 55 shards, each is ~500MB .tgz file, they are cached on disk (originals stored on S3). The sample contains a npy array and a cls item. When running this simple code

urls = [
    f"pipe:aws s3 cp s3://my_bucket/shard_{i:03d}.tgz -"
    for i in range(55)
]
dataset = wds.WebDataset(urls, cache_dir="./cache").decode()

for sample in tqdm(dataset):
    pass

I notice that the memory usage is steadily increasing, from 1GB at the beginning, to about 18GB at the end. I notice that memory increases happen in steps - arond the time a new shard is read.

What is the reason for this? I am not even using any shuffling buffer, I don't think anything should stay in memory.

Dec 13 '21 07:12 tadejsv

More context: the memory leak is most likely due to caching, if I remove cache_dir argument, RAM usage stays low and constant

Dec 13 '21 11:12 tadejsv

@tmbdev pinging just in case you missed this

Dec 21 '21 08:12 tadejsv

Yes, this is a known bug in v1. The caching has been rewritten in v2 and shouldn't leak anymore. I suggest you check out and use the v2 branch. I'll make that the main branch in the next few days.

Jan 11 '22 21:01 tmbdev

Awesome, happy to hear that.

Jan 11 '22 21:01 tadejsv

@tmbdev Is there any timeline on V2 support? I am also interested in streaming from S3 and cacheing the dataset locally but have been running into issues on main branch, particularly with some shards not being downloaded completely.

Jan 27 '22 02:01 abhi-mosaic

The caching code has been rewritten and this should not be an issue anymore.

Mar 05 '23 21:03 tmbdev

webdataset webdataset copied to clipboard

Memory leak in caching

webdataset
webdataset copied to clipboard