spark icon indicating copy to clipboard operation
spark copied to clipboard

[WIP][SPARK-54372][SQL] ANSI mode should reject avg/sum on timestamp types

Open ashrithb opened this issue 2 months ago • 0 comments

What changes were proposed in this pull request?

In ANSI mode, avg(timestamp) and sum(timestamp) now fail with a proper type error instead of silently casting to double.

Why are the changes needed?

Per SQL standards, AVG is only defined for numeric types. The current behavior (inherited from Hive compatibility) silently casts timestamp to epoch seconds as a double, which:

  1. Returns confusing results (e.g., -979200.0 for 1969-12-21)
  2. Violates ANSI SQL semantics
  3. Differs from other databases like PostgreSQL which reject this

Does this PR introduce any user-facing change?

Yes. In ANSI mode (spark.sql.ansi.enabled=true), queries like SELECT avg(timestamp_col) FROM table will now fail with: DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE: The 1st parameter requires the NUMERIC or ANSI INTERVAL type, however timestamp_col has the type TIMESTAMP.

Non-ANSI mode behavior is unchanged (backward compatible).

How was this patch tested?

  • Added unit tests in AnsiTypeCoercionSuite
  • Added unit tests in TypeCoercionSuite
  • All existing tests pass

Was this patch authored or co-authored using generative AI tooling?

No, except for PR description.

ashrithb avatar Dec 07 '25 06:12 ashrithb