spark
spark copied to clipboard
[WIP][SPARK-54372][SQL] ANSI mode should reject avg/sum on timestamp types
What changes were proposed in this pull request?
In ANSI mode, avg(timestamp) and sum(timestamp) now fail with a proper
type error instead of silently casting to double.
Why are the changes needed?
Per SQL standards, AVG is only defined for numeric types. The current behavior (inherited from Hive compatibility) silently casts timestamp to epoch seconds as a double, which:
- Returns confusing results (e.g.,
-979200.0for1969-12-21) - Violates ANSI SQL semantics
- Differs from other databases like PostgreSQL which reject this
Does this PR introduce any user-facing change?
Yes. In ANSI mode (spark.sql.ansi.enabled=true), queries like
SELECT avg(timestamp_col) FROM table will now fail with: DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE: The 1st parameter requires the NUMERIC or ANSI INTERVAL type, however timestamp_col has the type TIMESTAMP.
Non-ANSI mode behavior is unchanged (backward compatible).
How was this patch tested?
- Added unit tests in
AnsiTypeCoercionSuite - Added unit tests in
TypeCoercionSuite - All existing tests pass
Was this patch authored or co-authored using generative AI tooling?
No, except for PR description.