data
data copied to clipboard
Read Parquet Files Directly from S3?
🚀 The feature
The ParquetDataFrameLoader
allows us to read parquet files from the local file system, but I don't think it supports reading parquet files from (for example) an S3 bucket.
Make this possible.
Motivation, pitch
I would like to train my models on parquet files stored in an S3 bucket.
Alternatives
You could probably download the parquet file locally and then use the ParquetDataFrameLoader
?
Additional context
No response
cc: @NivekT
Does parquet
support loading dataframe from binary stream? If so, we might change the behavior of ParquetDataFrameLoader
from loading by file pathes to loading by binary streams. Then, users would be able to load remote parquest files on S3 either using AWSSDK or fsspec DataPipe
If there is a Parquet bytes
object, we can do:
reader = pyarrow.BufferReader(obj)
parquet_table = pyarrow.parquet.read_table(reader)
# Then convert to TorchArrow DataFrame or Pandas
Some other options are pandas.read_parquet()
(it returns a pandas
DataFrame) or pyarrow.parquet.ParquetDataset('s3://path-to-bucket/', filesystem=s3fs)
where s3fs
is an object that implements the s3 file system API (such as s3fs.S3FileSystem()
).
The first option (using pyarrow.BufferReader
) is likely the best based on our existing implementation.
I am open to accepting a PR that modify ParquetDataFrameLoader
that process bytes
.