streaming
streaming copied to clipboard
huge temp files while uploading data using MDS writer
Environment
- OS: Windows 11
To reproduce
Steps to reproduce the behavior:
- Upload and convert a local webdataset using MDS writer like the following code produces huge temp files (the webdataset is 800gb and the temp file is 1.8tb stored in AppData/Local/Temp eventually crashing the upload.
Code
file = r"file:d:/Datasets/shards50m/{00000..04999}.tar"
dataset = wds.WebDataset(file).decode("pil").to_tuple("jpg", "txt")
data_dir = "s3://50m/mds/"
columns = {
'image': 'pil',
'caption': 'str'
}
with MDSWriter(out=data_dir, columns=columns,progress_bar=True) as out:
try:
for sample in tqdm(dataset):
try:
if len(sample) != 2:
print("Skipping sample, missing 'txt' or 'jpg'.")
continue
img, caption = sample
sample = {
'image': img,
'caption': caption,
}
out.write(sample)
except Exception as e:
print(f"Error processing sample: {e}")
except Exception as e:
print(f"Error processing sample: {e}")