datafusion-comet
datafusion-comet copied to clipboard
[EPIC] [DISCUSS] Comet timezone handling
What is the problem the feature request solves?
Comet currently assumes that all native processing uses the UTC timezone. When reading from Parquet sources, Comet converts timestamps to UTC.
String timeZoneId = conf.get("spark.sql.session.timeZone");
// Native code uses "UTC" always as the timeZoneId when converting from spark to arrow schema.
Schema arrowSchema = Utils$.MODULE$.toArrowSchema(sparkSchema, "UTC");
byte[] serializedRequestedArrowSchema = serializeArrowSchema(arrowSchema);
Schema dataArrowSchema = Utils$.MODULE$.toArrowSchema(dataSchema, "UTC");
byte[] serializedDataArrowSchema = serializeArrowSchema(dataArrowSchema);
However, we are now seeing that this causes correctness issues or exceptions when the data source is not Parquet:
- https://github.com/apache/datafusion-comet/issues/2720
- https://github.com/apache/datafusion-comet/issues/2649
This epic is for reviewing and discussing Comet's approach to time zones.
Describe the potential solution
No response
Additional context
No response
when the data source is not Parquet:
when is this true?
when the data source is not Parquet:
when is this true?
For Comet sinks, such as:
- LocalTableScanExec
- CometSparkToColumnarExec
- Exchanges
- UnionExec
- CoalesceExec
Another related timezone issue:
Cast from string to timestamp:
case DataTypes.TimestampType if timeZoneId.exists(tz => tz != "UTC") =>
Incompatible(Some(s"Cast will use UTC instead of $timeZoneId"))