hdfs icon indicating copy to clipboard operation
hdfs copied to clipboard

parquet format support

Open weibingo opened this issue 6 years ago • 3 comments

Hi @mtth, is there any plan to support parquet data format?

parquet data has schema by self . so can read parquet to pandas directly, write is same . python parquet module: fastparquet , pyarrow

weibingo avatar Nov 17 '18 09:11 weibingo

Hi @weibingo, no plan currently but this would be a welcome PR. In the meantime, you would have to manually de/serialize the output of the raw read and write methods.

mtth avatar Nov 20 '18 09:11 mtth

Hi @weibingo, no plan currently but this would be a welcome PR. In the meantime, you would have to manually de/serialize the output of the raw read and write methods.

@mtth - do you have sample code for this approach?

wilberh avatar Aug 07 '20 16:08 wilberh

For reading a Pandas dataframe in parquet format from HDFS, currently I use a BytesIO object to read the parquet file into a bytes buffer completely first and pass this to pandas afterwards.

with hdfs_client.read(hdfs_path_file) as hdfs_reader:
    buffer = BytesIO(hdfs_reader.read())
    dataframe = pd.read_parquet(buffer)

If I try to pass the hdfs_reader to Pandas directly like

with hdfs_client.read(hdfs_path_file) as hdfs_reader:
    dataframe = pd.read_parquet(hdfs_reader)

I got the following error:

Traceback (most recent call last):
  File "...", line 940, in pandas_from_parquet
    dataframe = pd.read_parquet(hdfs_reader)
  File ".../lib/python3.6/site-packages/pandas/io/parquet.py", line 288, in read_parquet
    return impl.read(path, columns=columns, **kwargs)
  File ".../lib/python3.6/site-packages/pandas/io/parquet.py", line 131, in read
    **kwargs).to_pandas()
  File ".../lib/python3.6/site-packages/pyarrow/parquet.py", line 1076, in read_table
    pf = ParquetFile(source, metadata=metadata)
  File ".../lib/python3.6/site-packages/pyarrow/parquet.py", line 102, in __init__
    self.reader.open(source, metadata=metadata)
  File "pyarrow/_parquet.pyx", line 639, in pyarrow._parquet.ParquetReader.open
  File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: Arrow error: IOError: seek

Is there a way to read the parquet file into Pandas directly without reading it completely to a BytesIO object first?

ghost avatar Nov 30 '21 08:11 ghost