parquetjs
parquetjs copied to clipboard
PyArrow returns empty data frame
I don't know why, when reading the file generated by parquetjs with pyspark works fine, but when reading from pyarrow returns a empty dataframe.
In this example using spark return the table normally:
parquet_file = "./file.parquet" # Should be some file on your system
spark = SparkSession.builder.appName("TestingParquet").getOrCreate()
parquetFile = spark.read.parquet(parquet_file)
parquetFile.createOrReplaceTempView("parquetFile")
list = spark.sql("SELECT * FROM parquetFile")
list.show()
spark.stop()
When I try by pyarrow:
import pandas as pd
if __name__== "__main__":
df = pd.read_parquet('file.parquet', engine='pyarrow')
print df
print df.dtypes
Just return Empty Dataframe with the header's columns. Anyone with that problem?
I think you've already found the solution, but it would be useful for everyone else to note that the problem is probably around DATA_PAGE/DATA_PAGE_V2 specifications (see at https://github.com/ZJONSSON/parquetjs/issues/24#issuecomment-416009322) and the solution is this: https://github.com/ZJONSSON/parquetjs#notes