litdata Add support for parquet files for storing the chunks

Add support for parquet files for storing the chunks

Open tchaton opened this issue 1 year ago • 3 comments

This would enable users to avoid converting their dataset if they already have their dataset as parquet folders. We would need to run an indexation but this isn't too painful.

Jun 27 '24 15:06 tchaton

My preliminary approach:

In OptimizeDataset:

Only if parquet files, Use pyarrow read_table function to load each parquet file one by one and then the writer will only write the number of files, and column types in index.json file, no chunk files created.

In reader,

All indices will remain as usual, only the reading at index i will be changed:

df.slice(7, 1).to_pandas().to_dict() # parquet file 7th index value

If a parquet dataset has no index.json file, we can still call the helper function to generate index.json on the fly and then StreamingDataset takes control.

Why no multithreading or multiprocessing while creating index.json file:

Parquet files once loaded in memory are uncompressed and may exceed memory limit.

Or, we might take care of it in another PR.

What do you think @tchaton !

Aug 31 '24 19:08 deependujha

Yes, that's what I had in mind. The main challenge will be to make the slicing and reading as fast as possible. Might be worth to use: https://github.com/pola-rs/polars

Sep 02 '24 06:09 tchaton

The goal is to enable reading pyarrow HF datasets with LitData

Sep 03 '24 18:09 tchaton

litdata litdata copied to clipboard

Add support for parquet files for storing the chunks

Why no multithreading or multiprocessing while creating index.json file:

litdata
litdata copied to clipboard