hdfs
hdfs copied to clipboard
parquet format support
Hi @mtth, is there any plan to support parquet data format?
parquet data has schema by self . so can read parquet to pandas directly, write is same . python parquet module: fastparquet , pyarrow
Hi @weibingo, no plan currently but this would be a welcome PR. In the meantime, you would have to manually de/serialize the output of the raw read
and write
methods.
Hi @weibingo, no plan currently but this would be a welcome PR. In the meantime, you would have to manually de/serialize the output of the raw
read
andwrite
methods.
@mtth - do you have sample code for this approach?
For reading a Pandas dataframe in parquet format from HDFS, currently I use a BytesIO object to read the parquet file into a bytes buffer completely first and pass this to pandas afterwards.
with hdfs_client.read(hdfs_path_file) as hdfs_reader:
buffer = BytesIO(hdfs_reader.read())
dataframe = pd.read_parquet(buffer)
If I try to pass the hdfs_reader
to Pandas directly like
with hdfs_client.read(hdfs_path_file) as hdfs_reader:
dataframe = pd.read_parquet(hdfs_reader)
I got the following error:
Traceback (most recent call last):
File "...", line 940, in pandas_from_parquet
dataframe = pd.read_parquet(hdfs_reader)
File ".../lib/python3.6/site-packages/pandas/io/parquet.py", line 288, in read_parquet
return impl.read(path, columns=columns, **kwargs)
File ".../lib/python3.6/site-packages/pandas/io/parquet.py", line 131, in read
**kwargs).to_pandas()
File ".../lib/python3.6/site-packages/pyarrow/parquet.py", line 1076, in read_table
pf = ParquetFile(source, metadata=metadata)
File ".../lib/python3.6/site-packages/pyarrow/parquet.py", line 102, in __init__
self.reader.open(source, metadata=metadata)
File "pyarrow/_parquet.pyx", line 639, in pyarrow._parquet.ParquetReader.open
File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: Arrow error: IOError: seek
Is there a way to read the parquet file into Pandas directly without reading it completely to a BytesIO object first?