PyArrow returns empty data frame

Open fabiotakaki opened this issue 7 years ago • 1 comments

I don't know why, when reading the file generated by parquetjs with pyspark works fine, but when reading from pyarrow returns a empty dataframe.

In this example using spark return the table normally:

parquet_file = "./file.parquet"  # Should be some file on your system
spark = SparkSession.builder.appName("TestingParquet").getOrCreate()
parquetFile = spark.read.parquet(parquet_file)

parquetFile.createOrReplaceTempView("parquetFile")
list = spark.sql("SELECT * FROM parquetFile")
list.show()

spark.stop()

When I try by pyarrow:

import pandas as pd

if __name__== "__main__":
    df = pd.read_parquet('file.parquet', engine='pyarrow')
    print df
    print df.dtypes

Just return Empty Dataframe with the header's columns. Anyone with that problem?

Nov 16 '18 13:11 fabiotakaki

I think you've already found the solution, but it would be useful for everyone else to note that the problem is probably around DATA_PAGE/DATA_PAGE_V2 specifications (see at https://github.com/ZJONSSON/parquetjs/issues/24#issuecomment-416009322) and the solution is this: https://github.com/ZJONSSON/parquetjs#notes

Jun 04 '19 12:06 szdominik