webdataset icon indicating copy to clipboard operation
webdataset copied to clipboard

Memory leak in caching

Open tadejsv opened this issue 3 years ago • 5 comments

I have a dataset that has about 55 shards, each is ~500MB .tgz file, they are cached on disk (originals stored on S3). The sample contains a npy array and a cls item. When running this simple code

urls = [
    f"pipe:aws s3 cp s3://my_bucket/shard_{i:03d}.tgz -"
    for i in range(55)
]
dataset = wds.WebDataset(urls, cache_dir="./cache").decode()

for sample in tqdm(dataset):
    pass

I notice that the memory usage is steadily increasing, from 1GB at the beginning, to about 18GB at the end. I notice that memory increases happen in steps - arond the time a new shard is read.

What is the reason for this? I am not even using any shuffling buffer, I don't think anything should stay in memory.

tadejsv avatar Dec 13 '21 07:12 tadejsv

More context: the memory leak is most likely due to caching, if I remove cache_dir argument, RAM usage stays low and constant

tadejsv avatar Dec 13 '21 11:12 tadejsv

@tmbdev pinging just in case you missed this

tadejsv avatar Dec 21 '21 08:12 tadejsv

Yes, this is a known bug in v1. The caching has been rewritten in v2 and shouldn't leak anymore. I suggest you check out and use the v2 branch. I'll make that the main branch in the next few days.

tmbdev avatar Jan 11 '22 21:01 tmbdev

Awesome, happy to hear that.

tadejsv avatar Jan 11 '22 21:01 tadejsv

@tmbdev Is there any timeline on V2 support? I am also interested in streaming from S3 and cacheing the dataset locally but have been running into issues on main branch, particularly with some shards not being downloaded completely.

abhi-mosaic avatar Jan 27 '22 02:01 abhi-mosaic

The caching code has been rewritten and this should not be an issue anymore.

tmbdev avatar Mar 05 '23 21:03 tmbdev