streaming icon indicating copy to clipboard operation
streaming copied to clipboard

huge temp files while uploading data using MDS writer

Open MaxxP0 opened this issue 7 months ago • 2 comments

Environment

  • OS: Windows 11

To reproduce

Steps to reproduce the behavior:

  1. Upload and convert a local webdataset using MDS writer like the following code produces huge temp files (the webdataset is 800gb and the temp file is 1.8tb stored in AppData/Local/Temp eventually crashing the upload.

Code

file = r"file:d:/Datasets/shards50m/{00000..04999}.tar"

dataset = wds.WebDataset(file).decode("pil").to_tuple("jpg", "txt")

data_dir = "s3://50m/mds/"

columns = {
    'image': 'pil',
    'caption': 'str'
}


with MDSWriter(out=data_dir, columns=columns,progress_bar=True) as out:
    try:
        for sample in tqdm(dataset):
            try:
                if len(sample) != 2:
                    print("Skipping sample, missing 'txt' or 'jpg'.")
                    continue

                img, caption = sample
                
                sample = {
                    'image': img,
                    'caption': caption,
                }
                out.write(sample)
            except Exception as e:
                print(f"Error processing sample: {e}")

    except Exception as e:
        print(f"Error processing sample: {e}")

MaxxP0 avatar Jul 24 '24 17:07 MaxxP0