MLDatasets.jl icon indicating copy to clipboard operation
MLDatasets.jl copied to clipboard

write datasets in a JLD2 or Arrow format for faster read

Open CarloLucibello opened this issue 2 years ago • 6 comments

We could have a "processed" folder in each dataset folder where we write the dataset object the first time we create it. In the following creations, e.g. d = MNIST() we just load the JLD2 file.

Example:

function MNIST(...)
    dataset_dir = ...
    processed_file = joinpath(dataset_dir, "processed", "dataset.jld2") 
    if isfile(processed_file) 
        return FileIO.load(processed_file, "dataset")
    end 

    mnist = ...
    if isfile(processed_file) 
        FileIO.save(processed_file, Dict("dataset" => mnist))
    end 
    return mnist
end

CarloLucibello avatar May 06 '22 06:05 CarloLucibello

Have done this for large vision datasets like COCO that have annotations in JSON which can be slow to parse. One thing to keep in mind is the size of the JLD2 files, though of course it shouldn't be a problem for MNIST. Arrow.jl can also be a good format with built-in compression when the data has samples made up of primitive types and arrays.

lorenzoh avatar May 06 '22 07:05 lorenzoh

What's to be expected from the JLD2 sizes? hopefully not larger than the size of the original data, right?

CarloLucibello avatar May 06 '22 07:05 CarloLucibello

Depends. If you have a large dataset of .jpg images and store them as arrays (hence losslessly), size can be multiples.

lorenzoh avatar May 06 '22 08:05 lorenzoh

I agree too Arrow.jl is a good format:

  1. built-in compression
  2. Cross-language processing dataset

zsz00 avatar May 07 '22 03:05 zsz00

HuggingFace's datasets library also uses Arrow: https://huggingface.co/docs/datasets/about_arrow

CarloLucibello avatar May 14 '22 07:05 CarloLucibello

some code showing how to read/write color arrays from/to arrow tables https://gist.github.com/CarloLucibello/51d713ec4a1612b46e6c90e53c0f88e8

CarloLucibello avatar Feb 11 '23 22:02 CarloLucibello