AzureStorageExplorer
AzureStorageExplorer copied to clipboard
Parquet Preview: human readable `timestamp` and `date` values
Preflight Checklist
- [X] I have installed the latest version of Storage Explorer.
- [X] I have checked existing resources, including the troubleshooting guide and the release notes.
- [X] I have searched for similar issues.
Problem
when previewing parquet files, values of columns with datatype timestamp and date are hard to read.
Example parquet: part-00000-43831db6-19d5-4964-a8c8-cb8d6d1664b3-c000.snappy.parquet.zip
PySpark code for reproducing the example parquet:
import pyspark.sql.functions as F
df = (
spark.range(3).toDF("id")
.withColumn("ts_base", F.to_timestamp(F.lit('2021-07-24 12:01:19.123456789012')))
.withColumn("ts", (F.unix_timestamp(F.col("ts_base")) + F.col("id") * 10000000).cast("timestamp"))
.withColumn("ts_str", F.col("ts").cast("string"))
.withColumn("ts_unix", F.unix_timestamp(F.col("ts")))
.withColumn("date", F.col("ts").cast("date"))
.withColumn("date_str", F.col("date").cast("string"))
.drop("ts_base")
)
df.printSchema()
display(df)
df.coalesce(1).write.format("parquet").mode("overwrite").option("overwriteSchema", "True").save(root_path + 'test1/')
Desired Solution
timestamp and date column values should be rendered in a (easily) human readable format, e.g yyyy-MM-dd'T'HH:mm:ss[.SSSSSS]'Z' for timestamp and yyyy-MM-dd for dates (ISO 8601)
Alternatives and Workarounds
No response
Additional Context
No response
@keen85, slightly unrelated to your issue, would you mind sharing what is shown when you click on "Chart" in your Jupyter Notebook? :)
@MRayermannMSFT sure, but it has little meaning for this dataframe:
This is some convenience feature of Azure Synapse Analytics: https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-data-visualization#displaydf-function
@MRayermannMSFT sure, but it has little meaning for this dataframe
Sure, was just curious. :)
same issue here on v1.31.1 (93)
@keen85 @pierrejeandev
For anyone that's interested, we have a private build that adds date formatting to preview for Parquet files. Please give it a try and leave your feedback here.
Windows: https://cralvordtest02.blob.core.windows.net/share/1.32.1-datepreview.1/StorageExplorer-windows-x64.exe?sv=2023-01-03&st=2023-12-07T18%3A38%3A14Z&se=2023-12-22T18%3A38%3A00Z&sr=b&sp=r&sig=DJPjcxkltVwrP7oSR74KGzf%2B3NxY1jOvi6%2FG783F45U%3D
For any other platforms, please let us know, and we'll provide additional links.
Hi @craxal ,
Thanks for your work.
I had Storage Explorer installed already, downloaded your preview and re-installed it; but I'm not sure if this was successful.
Info dialog looks like this:
I tried previewing the parquet file that I provided originally.
Preview looks for me like this (but I'm not sure if I actually tested your preview build):
So
timestamp is still not readable and date is localized but also includes time "00:00". I'd rather prefer ISO 8601 format.
@keen85 We can stick to ISO format unless we get additional feedback to the contrary.
Our testing shows that DATE, TIMESTAMP_MILLIS, and TIMESTAMP_MICRO types should be parsed correctly. But from your Parquet file, only the date field has a DATE type. All the other fields are just UTF8 or integer types. I don't think the parsing library we have handles TIMESTAMP(...) types at the moment.
Supporting timestamp types is possible with deprecated TIME_MILLIS, TIME_MICROS, TIMESTAMP_MILLIS, and TIMESTAMP_MICROS. Additional support for TIME and TIMESTAMP is forthcoming (LibertyDSNP/parquetjs#99).
Hi @craxal ,
I did some research and I think I understand now, what the problem is concerning TIMESTAMP...
In the past, parquet used to persist TIMESTAMP columns as physical data type INT96. As of my understanding INT96 was used exclusively to represent TIMESTAMP columns. That is why no logical data type needed to be specified.
From parquet side, this is deprecated and INT64 should be used with logical data types TIMESTAMP(unit=MILLIS) or TIMESTAMP(unit=MICROS) or TIMESTAMP(unit=NANOS).
However, despite deprecation, there are very prominent data processing engines (e.g. Apache Spark, Apache Impala) that still write parquet files the old way, with TIMESTAMPS as INT96 (with no logical data type) per default because of the larger precision / range. This is also true for prominent Microsoft/Azure services (e.g. Azure Synapse, Azure HDI, Microsoft Fabric, Databricks) which use Apache Spark.
I created the example parquet file using Azure Synapse. That is why the parsing library sees TIMESTAMP as INT96.
It would be greatly appreciated, if Azure Storage Explorer would support this old, deprecated encoding. I saw that there is a parquetjs feature request for supporting INT96, so maybe this will solve the issue indirectly for Azure Storage Explorer?
We will keep this item open to track the progress of those features. Not much else we can do until fixes/features become available.
@craxal for 1.34 let's start with a 1 day initial investigation into if we have the skill needed to contribute back to the open source library.
Let's break this down a bit. There currently exist three different ways to represent timestamps in the Parquet format:
| Type | Description |
|---|---|
INT96 |
These were originally intended to store timestamps with nanosecond precision. This type is deprecated but remains for backward compatibility. The LibertyDNSP/parquetjs library currently supports this type. Storage Explorer is not formatting these values as timestamps, but it should be reasonably simple to do so. |
TIMESTAMP_MILLIS<BR/>TIMETAMP_MICROS |
These are what the Parquet spec refers to as ConvertedTypes, which are deprecated. The LibertyDNSP/parquetjs library currently supports these types. Storage Explorer currently formats these values as timestamps. |
TIMESTAMP |
This is what the Parquet spec refers to as a LogicalType. They differ from ConvertedTypes in that they have parameters (for example, TIMESTAMP(unit=NANOS, isAdjustedToUTC=true)). The LibertyDNSP/parquetjs library only appears to support ConvertedTypes. If we were to contribute to the library by adding support for LogicalTypes of any kind, this would be a considerable undertaking (understanding the Parquet encoding, learning the library codebase, adding necessary logic throughout to encode/decode, adding tests, etc.). |
This leads me to conclude that we should do the following:
- Format
INT96values as timestamps, since this is likely still widely in use. - Wait for support for
LogicalTypes to be added to the library by those who understand Parquet and the codebase better. If demand is high, we can push for support. Alternatively, we can tinker with it over time ourselves and eventually add support at some future time.
We started running into a build-related issue (see https://github.com/LibertyDSNP/parquetjs/issues/125).
@keen85 It turns out converting INT96 values into ISO-formatted strings is a lot more complicated than I initially thought.
Incorrect Assumptions
At first, I assumed INT96 values simply encoded the number of nanoseconds from some epoch. If that were true, the question is which epoch?
Upon further investigation, I found things to be much more complex. The first eight bytes represent the time of day in nanoseconds (the bytes have to be reversed and possibly converted if it's negative). The last 4 bytes represent a Julian day (the bytes also need to be reversed, and the Julian day needs to be converted to a Gregorian date).
This is a lot of work to convert a number to a recognizable date, work that is better left, I think, to the Parquet library.
Varying Interpretations
As if a complex encoding weren't enough, interpreting INT96 values can vary. Spark configurations can change how values are read or written. This means INT96 values might not be consistently represented. Furthermore, only Impala seems to strictly interpret INT96 values as dates/times. That is, in the broader ecosystem, INT96 values might not always be considered dates/times.
As an example, in the discussion for the PR you linked to, there's mention that a reliable method of determining when an INT96 can be considered a date/time is unknown.
Additionally, it's not clear whether the values in your sample data align with the above encoding (the most significant 6 bytes are all zeroes, for starters), which suggests your data may be interpreting INT96 date/times value differently.
These consistency problems are probably why the INT96 data type was deprecated in the first place.
Conclusion
All of these points add up to a big consistency problem in how Storage Explorer should interpret INT96 values. If Storage Explorer were designed more specifically for Parquet data, then we would want to consider supporting INT96 as dates/times more carefully. However, I think the best (and safest) way to move forward is leave Storage Explorer as is and encourage the Parquet library authors to add support for this.
Further reading
Hi @craxal, Thanks for your comprehensive analysis. I was not fully aware of these problems when I opened the issue, sorry for that.
I understand your reasoning and suggest closing this issue.
@keen85 No apologies necessary! It was a perfectly valid request. We did our homework, and it turned out to be more work than we initially thought. All part of the process. Thank you for the feedback.