parquet-rs
parquet-rs copied to clipboard
Writing Dates and Timestamps
I'm continuing with my adventures of writing csv to parquet, but I got stuck with how to write times/dates to parquet.
Specifically, how do I declare the schema (assuming I'm using the text format message schema {}
)?
I read up on the logical types and their mapping to/from data types, so I tried using i64
for my schema, but I think I'm missing something because I don't know how to map the type to a TIMESTAMP
.
I also tried Google, to try look for the format of the schema, but with no luck (for timestamps). Is there some place that documents this?
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#datetime-types
I would use TIMESTAMP_MILLIS now, which is just INT64 with corresponding logical type, probably the easiest to write.
Thanks @sadikovi, I was confused by the UTC stuff on the timestamp logical type.
Writing a timestamp now works with message schema {REQUIRED INT64 MyField (TIMESTAMP_MILLIS)}
, but I'm unable to read the parquet file back in Pandas or PySpark.
PySpark:
spark.read.parquet("file1.parquet").printSchema()
// this correctly shows the schema as below, but .show() throws an error
// printing schema
root
|-- Id: string (nullable = true)
|-- Name: string (nullable = true)
|-- Indicator: boolean (nullable = true)
|-- Timestamp: timestamp (nullable = true)
# trying to show records
Py4JJavaError: An error occurred while calling o62.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 6.0 failed 1 times, most recent failure: Lost task 0.0 in stage 6.0 (TID 16, localhost, executor driver): org.apache.parquet.io.ParquetDecodingException: Dictionary encoding not supported for type: BOOLEAN
Pandas:
pd.read_parquet("file1.parquet")
ArrowIOError: Not yet implemented: Dictionary encoding is not implemented for boolean values.