datafusion-comet icon indicating copy to clipboard operation
datafusion-comet copied to clipboard

[EPIC] [DISCUSS] Comet timezone handling

Open andygrove opened this issue 1 month ago • 3 comments

What is the problem the feature request solves?

Comet currently assumes that all native processing uses the UTC timezone. When reading from Parquet sources, Comet converts timestamps to UTC.

      String timeZoneId = conf.get("spark.sql.session.timeZone");
      // Native code uses "UTC" always as the timeZoneId when converting from spark to arrow schema.
      Schema arrowSchema = Utils$.MODULE$.toArrowSchema(sparkSchema, "UTC");
      byte[] serializedRequestedArrowSchema = serializeArrowSchema(arrowSchema);
      Schema dataArrowSchema = Utils$.MODULE$.toArrowSchema(dataSchema, "UTC");
      byte[] serializedDataArrowSchema = serializeArrowSchema(dataArrowSchema);

However, we are now seeing that this causes correctness issues or exceptions when the data source is not Parquet:

  • https://github.com/apache/datafusion-comet/issues/2720
  • https://github.com/apache/datafusion-comet/issues/2649

This epic is for reviewing and discussing Comet's approach to time zones.

Describe the potential solution

No response

Additional context

No response

andygrove avatar Nov 07 '25 21:11 andygrove

when the data source is not Parquet:

when is this true?

parthchandra avatar Nov 07 '25 22:11 parthchandra

when the data source is not Parquet:

when is this true?

For Comet sinks, such as:

  • LocalTableScanExec
  • CometSparkToColumnarExec
  • Exchanges
  • UnionExec
  • CoalesceExec

andygrove avatar Nov 11 '25 15:11 andygrove

Another related timezone issue:

Cast from string to timestamp:

      case DataTypes.TimestampType if timeZoneId.exists(tz => tz != "UTC") =>
        Incompatible(Some(s"Cast will use UTC instead of $timeZoneId"))

andygrove avatar Nov 11 '25 15:11 andygrove