data icon indicating copy to clipboard operation
data copied to clipboard

Read Parquet Files Directly from S3?

Open vedantroy opened this issue 1 year ago • 2 comments

🚀 The feature

The ParquetDataFrameLoader allows us to read parquet files from the local file system, but I don't think it supports reading parquet files from (for example) an S3 bucket.

Make this possible.

Motivation, pitch

I would like to train my models on parquet files stored in an S3 bucket.

Alternatives

You could probably download the parquet file locally and then use the ParquetDataFrameLoader?

Additional context

No response

vedantroy avatar Jul 30 '22 06:07 vedantroy

cc: @NivekT Does parquet support loading dataframe from binary stream? If so, we might change the behavior of ParquetDataFrameLoader from loading by file pathes to loading by binary streams. Then, users would be able to load remote parquest files on S3 either using AWSSDK or fsspec DataPipe

ejguan avatar Aug 01 '22 14:08 ejguan

If there is a Parquet bytes object, we can do:

reader = pyarrow.BufferReader(obj)
parquet_table = pyarrow.parquet.read_table(reader)
# Then convert to TorchArrow DataFrame or Pandas

Some other options are pandas.read_parquet() (it returns a pandas DataFrame) or pyarrow.parquet.ParquetDataset('s3://path-to-bucket/', filesystem=s3fs) where s3fs is an object that implements the s3 file system API (such as s3fs.S3FileSystem()).

The first option (using pyarrow.BufferReader) is likely the best based on our existing implementation.

I am open to accepting a PR that modify ParquetDataFrameLoader that process bytes.

NivekT avatar Aug 01 '22 15:08 NivekT