litdata
litdata copied to clipboard
Add support for parquet files for storing the chunks
This would enable users to avoid converting their dataset if they already have their dataset as parquet folders. We would need to run an indexation but this isn't too painful.
My preliminary approach:
In OptimizeDataset:
Only if parquet files, Use pyarrow read_table function to load each parquet file one by one and then the writer will only write the number of files, and column types in index.json file, no chunk files created.
In reader,
All indices will remain as usual, only the reading at index i will be changed:
df.slice(7, 1).to_pandas().to_dict() # parquet file 7th index value
If a parquet dataset has no index.json file, we can still call the helper function to generate index.json on the fly and then StreamingDataset takes control.
Why no multithreading or multiprocessing while creating index.json file:
- Parquet files once loaded in memory are uncompressed and may exceed memory limit.
Or, we might take care of it in another PR.
What do you think @tchaton !
Yes, that's what I had in mind. The main challenge will be to make the slicing and reading as fast as possible. Might be worth to use: https://github.com/pola-rs/polars
The goal is to enable reading pyarrow HF datasets with LitData