AzureStorageExplorer Parquet Preview: human readable `timestamp` and `date` values

Preflight Checklist

[X] I have installed the latest version of Storage Explorer.
[X] I have checked existing resources, including the troubleshooting guide and the release notes.
[X] I have searched for similar issues.

Problem

when previewing parquet files, values of columns with datatype timestamp and date are hard to read. 20230807_FeatureRequest_AzureStorageExplorer_ParquetPreview

Example parquet: part-00000-43831db6-19d5-4964-a8c8-cb8d6d1664b3-c000.snappy.parquet.zip

PySpark code for reproducing the example parquet:

import pyspark.sql.functions as F

df = (
    spark.range(3).toDF("id")
    .withColumn("ts_base", F.to_timestamp(F.lit('2021-07-24 12:01:19.123456789012')))
    .withColumn("ts", (F.unix_timestamp(F.col("ts_base")) + F.col("id") * 10000000).cast("timestamp"))
    .withColumn("ts_str", F.col("ts").cast("string"))
    .withColumn("ts_unix", F.unix_timestamp(F.col("ts")))
    .withColumn("date", F.col("ts").cast("date"))
    .withColumn("date_str", F.col("date").cast("string"))
    .drop("ts_base")
)
df.printSchema()
display(df)
df.coalesce(1).write.format("parquet").mode("overwrite").option("overwriteSchema", "True").save(root_path + 'test1/')

Desired Solution

timestamp and date column values should be rendered in a (easily) human readable format, e.g yyyy-MM-dd'T'HH:mm:ss[.SSSSSS]'Z' for timestamp and yyyy-MM-dd for dates (ISO 8601)

Alternatives and Workarounds

No response

Additional Context

No response

Aug 07 '23 12:08 keen85

@keen85, slightly unrelated to your issue, would you mind sharing what is shown when you click on "Chart" in your Jupyter Notebook? :)

Aug 07 '23 17:08 MRayermannMSFT

@MRayermannMSFT sure, but it has little meaning for this dataframe:

This is some convenience feature of Azure Synapse Analytics: https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-data-visualization#displaydf-function

Aug 07 '23 18:08 keen85

@MRayermannMSFT sure, but it has little meaning for this dataframe

Sure, was just curious. :)

Aug 07 '23 20:08 MRayermannMSFT

same issue here on v1.31.1 (93)

Aug 29 '23 14:08 pierrejeandev

@keen85 @pierrejeandev

For anyone that's interested, we have a private build that adds date formatting to preview for Parquet files. Please give it a try and leave your feedback here.

Windows: https://cralvordtest02.blob.core.windows.net/share/1.32.1-datepreview.1/StorageExplorer-windows-x64.exe?sv=2023-01-03&st=2023-12-07T18%3A38%3A14Z&se=2023-12-22T18%3A38%3A00Z&sr=b&sp=r&sig=DJPjcxkltVwrP7oSR74KGzf%2B3NxY1jOvi6%2FG783F45U%3D

For any other platforms, please let us know, and we'll provide additional links.

Dec 07 '23 18:12 craxal

Hi @craxal , Thanks for your work. I had Storage Explorer installed already, downloaded your preview and re-installed it; but I'm not sure if this was successful. Info dialog looks like this:

I tried previewing the parquet file that I provided originally.

Preview looks for me like this (but I'm not sure if I actually tested your preview build): So timestamp is still not readable and date is localized but also includes time "00:00". I'd rather prefer ISO 8601 format.

Dec 10 '23 17:12 keen85

@keen85 We can stick to ISO format unless we get additional feedback to the contrary.

Our testing shows that DATE, TIMESTAMP_MILLIS, and TIMESTAMP_MICRO types should be parsed correctly. But from your Parquet file, only the date field has a DATE type. All the other fields are just UTF8 or integer types. I don't think the parsing library we have handles TIMESTAMP(...) types at the moment.

Supporting timestamp types is possible with deprecated TIME_MILLIS, TIME_MICROS, TIMESTAMP_MILLIS, and TIMESTAMP_MICROS. Additional support for TIME and TIMESTAMP is forthcoming (LibertyDSNP/parquetjs#99).

Jan 06 '24 00:01 craxal

Hi @craxal , I did some research and I think I understand now, what the problem is concerning TIMESTAMP...

In the past, parquet used to persist TIMESTAMP columns as physical data type INT96. As of my understanding INT96 was used exclusively to represent TIMESTAMP columns. That is why no logical data type needed to be specified. From parquet side, this is deprecated and INT64 should be used with logical data types TIMESTAMP(unit=MILLIS) or TIMESTAMP(unit=MICROS) or TIMESTAMP(unit=NANOS).

However, despite deprecation, there are very prominent data processing engines (e.g. Apache Spark, Apache Impala) that still write parquet files the old way, with TIMESTAMPS as INT96 (with no logical data type) per default because of the larger precision / range. This is also true for prominent Microsoft/Azure services (e.g. Azure Synapse, Azure HDI, Microsoft Fabric, Databricks) which use Apache Spark. I created the example parquet file using Azure Synapse. That is why the parsing library sees TIMESTAMP as INT96.

It would be greatly appreciated, if Azure Storage Explorer would support this old, deprecated encoding. I saw that there is a parquetjs feature request for supporting INT96, so maybe this will solve the issue indirectly for Azure Storage Explorer?

Jan 06 '24 13:01 keen85

We will keep this item open to track the progress of those features. Not much else we can do until fixes/features become available.

Jan 09 '24 00:01 craxal

@craxal for 1.34 let's start with a 1 day initial investigation into if we have the skill needed to contribute back to the open source library.

Feb 22 '24 18:02 MRayermannMSFT

Let's break this down a bit. There currently exist three different ways to represent timestamps in the Parquet format:

Type	Description
`INT96`	These were originally intended to store timestamps with nanosecond precision. This type is deprecated but remains for backward compatibility. The LibertyDNSP/parquetjs library currently supports this type. Storage Explorer is not formatting these values as timestamps, but it should be reasonably simple to do so.
`TIMESTAMP_MILLIS`<BR/>`TIMETAMP_MICROS`	These are what the Parquet spec refers to as `ConvertedType`s, which are deprecated. The LibertyDNSP/parquetjs library currently supports these types. Storage Explorer currently formats these values as timestamps.
`TIMESTAMP`	This is what the Parquet spec refers to as a `LogicalType`. They differ from `ConvertedType`s in that they have parameters (for example, `TIMESTAMP(unit=NANOS, isAdjustedToUTC=true)`). The LibertyDNSP/parquetjs library only appears to support `ConvertedType`s. If we were to contribute to the library by adding support for `LogicalType`s of any kind, this would be a considerable undertaking (understanding the Parquet encoding, learning the library codebase, adding necessary logic throughout to encode/decode, adding tests, etc.).

This leads me to conclude that we should do the following:

Format INT96 values as timestamps, since this is likely still widely in use.
Wait for support for LogicalTypes to be added to the library by those who understand Parquet and the codebase better. If demand is high, we can push for support. Alternatively, we can tinker with it over time ourselves and eventually add support at some future time.

Mar 07 '24 21:03 craxal

We started running into a build-related issue (see https://github.com/LibertyDSNP/parquetjs/issues/125).

Apr 15 '24 15:04 craxal

@keen85 It turns out converting INT96 values into ISO-formatted strings is a lot more complicated than I initially thought.

Incorrect Assumptions

At first, I assumed INT96 values simply encoded the number of nanoseconds from some epoch. If that were true, the question is which epoch?

Upon further investigation, I found things to be much more complex. The first eight bytes represent the time of day in nanoseconds (the bytes have to be reversed and possibly converted if it's negative). The last 4 bytes represent a Julian day (the bytes also need to be reversed, and the Julian day needs to be converted to a Gregorian date).

This is a lot of work to convert a number to a recognizable date, work that is better left, I think, to the Parquet library.

Varying Interpretations

As if a complex encoding weren't enough, interpreting INT96 values can vary. Spark configurations can change how values are read or written. This means INT96 values might not be consistently represented. Furthermore, only Impala seems to strictly interpret INT96 values as dates/times. That is, in the broader ecosystem, INT96 values might not always be considered dates/times.

As an example, in the discussion for the PR you linked to, there's mention that a reliable method of determining when an INT96 can be considered a date/time is unknown.

Additionally, it's not clear whether the values in your sample data align with the above encoding (the most significant 6 bytes are all zeroes, for starters), which suggests your data may be interpreting INT96 date/times value differently.

These consistency problems are probably why the INT96 data type was deprecated in the first place.

Conclusion

All of these points add up to a big consistency problem in how Storage Explorer should interpret INT96 values. If Storage Explorer were designed more specifically for Parquet data, then we would want to consider supporting INT96 as dates/times more carefully. However, I think the best (and safest) way to move forward is leave Storage Explorer as is and encourage the Parquet library authors to add support for this.

AzureStorageExplorer
AzureStorageExplorer copied to clipboard

Parquet Preview: human readable `timestamp` and `date` values

Preflight Checklist

Problem

Desired Solution

Alternatives and Workarounds

Additional Context

Incorrect Assumptions

Varying Interpretations

Conclusion

Further reading

AzureStorageExplorer AzureStorageExplorer copied to clipboard

Parquet Preview: human readable `timestamp` and `date` values

Preflight Checklist

Problem

Desired Solution

Alternatives and Workarounds

Additional Context

Incorrect Assumptions

Varying Interpretations

Conclusion

Further reading

AzureStorageExplorer
AzureStorageExplorer copied to clipboard