delta icon indicating copy to clipboard operation
delta copied to clipboard

Timestamp precision in the schema of the DeltaTable of the transaction log.

Open fvaleye opened this issue 4 years ago • 6 comments

Hello!

Coming from the Delta-RS community, I have several questions regarding the timestamp type in the DeltaTable schema serialization saved in the transaction log.

Context The transaction protocol schema serialization format specifies the schema serialization format for the timestamp type with the following precision:

timestamp: Microsecond precision timestamp without a timezone.

It means that Spark uses a timestamp with microsecond precision here given a local or given timezone. But when Spark writes timestamp values out to non-text data sources like Parquet using Delta, the values are just instants (like timestamp in UTC) that have no time zone information.

Taking that into account, if we look at the configuration "spark.sql.parquet.outputTimestampType" here, we see that the default output timestamp used is "ParquetOutputTimestampType.INT96.toString". This timestamp used by this default is with a nanosecond precision when writing .parquet files. But it also could be changed to ParquetOutputTimestampType.INT64 with TIMESTAMP_MICROS or ParquetOutputTimestampType.INT64 with TIMESTAMP_MILLIS.

Use-case When I am applying a transaction log schema on a DeltaTable (using timestamp with the microsecond precision here), I have a mismatched between the precision of the timestamp given by the schema of the protocol and the real one:

  1. The precision of the timestamp type referenced by the transaction log is with a microsecond precision
  2. The precision of the timestamp type written in the .parquet files is with a nanosecond precision because it uses the default outputTimestampType (but could be microseconds or milliseconds depending on the configuration)
  3. The schema couldn't be applied on the .parquet files because I have a mismatched precision error on a timestamp column

Questions

  1. Why the precision of the timestamp is not written with the timestamp type inside the schema of the transaction log? It will be used if we want to get the DeltaTable schema timestamp precision if we read the DeltaTable without the Spark dependency.

  2. Does it means that the precision of the timestamp with microsecond precision for internal Spark/Delta is for internal processing only? In other words, the schema of parquet files must only be directly read from the .parquet files and not from the DeltaTable transaction protocol.

  3. If we change the default timestamp precision to nanoseconds here for applying the schema on .parquet files, it will work only for the default spark.sql.parquet.outputTimestampType configuration, but not for the TIMESTAMP_MICROS and TIMESTAMP_MILLIS ones, right?

Thank you for your help!

fvaleye avatar Apr 07 '21 12:04 fvaleye

Did the loop on this ever get closed? I've ran into this a few times when adding parquet files to delta tables because the timestamps are written with different configurations.

hntd187 avatar Oct 15 '23 16:10 hntd187

Since parquet 2.6 has a great int64 timestamp nanos type, could delta standardize on top of that? Java also has nanosecond precision

alippai avatar Nov 02 '23 04:11 alippai

Iceberg is adding nanosecond type too: https://github.com/apache/iceberg/pull/8683

alippai avatar May 07 '24 00:05 alippai

@alippai that's great! Unfortunately for Delta we are bound by what the delta protocol states :(

ion-elgreco avatar Aug 04 '24 22:08 ion-elgreco

@ion-elgreco how can we extend the delta protocol? I thought this is the correct issue / repo for that.

alippai avatar Aug 04 '24 22:08 alippai

@ion-elgreco how can we extend the delta protocol? I thought this is the correct issue / repo for that.

It's the correct repo, but it needs to get accepted in the protocol first

ion-elgreco avatar Aug 04 '24 23:08 ion-elgreco