webdataset
webdataset copied to clipboard
Memory leak in caching
I have a dataset that has about 55 shards, each is ~500MB .tgz file, they are cached on disk (originals stored on S3). The sample contains a npy array and a cls item. When running this simple code
urls = [
f"pipe:aws s3 cp s3://my_bucket/shard_{i:03d}.tgz -"
for i in range(55)
]
dataset = wds.WebDataset(urls, cache_dir="./cache").decode()
for sample in tqdm(dataset):
pass
I notice that the memory usage is steadily increasing, from 1GB at the beginning, to about 18GB at the end. I notice that memory increases happen in steps - arond the time a new shard is read.
What is the reason for this? I am not even using any shuffling buffer, I don't think anything should stay in memory.
More context: the memory leak is most likely due to caching, if I remove cache_dir argument, RAM usage stays low and constant
@tmbdev pinging just in case you missed this
Yes, this is a known bug in v1. The caching has been rewritten in v2 and shouldn't leak anymore. I suggest you check out and use the v2 branch. I'll make that the main branch in the next few days.
Awesome, happy to hear that.
@tmbdev Is there any timeline on V2 support? I am also interested in streaming from S3 and cacheing the dataset locally but have been running into issues on main branch, particularly with some shards not being downloaded completely.
The caching code has been rewritten and this should not be an issue anymore.