datafusion-comet icon indicating copy to clipboard operation
datafusion-comet copied to clipboard

bug: CAST timestamp to string ignores timezone prior to Spark 3.4

Open andygrove opened this issue 1 year ago • 2 comments

Describe the bug

In CometExpressionSuite we have two tests that are ignored for Spark 3.2 and 3.3.

  test("cast timestamp and timestamp_ntz to string") {
    // TODO: make the test pass for Spark 3.2 & 3.3
    assume(isSpark34Plus)
  test("cast timestamp and timestamp_ntz to long, date") {
    // TODO: make the test pass for Spark 3.2 & 3.3
    assume(isSpark34Plus)

Enabling these tests for 3.2 shows incorrect output:

== Results ==
  !== Correct Answer - 2001 ==                                                                         == Spark Answer - 2001 ==
   struct<tz_millis:string,ntz_millis:string,tz_micros:string,ntz_micros:string>                       struct<tz_millis:string,ntz_millis:string,tz_micros:string,ntz_micros:string>
  ![1970-01-01 05:29:59.991,1970-01-01 05:29:59.991,1970-01-01 05:29:59.991,1970-01-01 05:29:59.991]   [1970-01-01 05:29:59.991,1969-12-31 23:59:59.991,1970-01-01 05:29:59.991,1969-12-31 23:59:59.991]
  == Results ==
  !== Correct Answer - 10000 ==                                                                                                              == Spark Answer - 10000 ==
   struct<tz_millis:bigint,tz_micros:bigint,tz_millis_to_date:date,ntz_millis_to_date:date,tz_micros_to_date:date,ntz_micros_to_date:date>   struct<tz_millis:bigint,tz_micros:bigint,tz_millis_to_date:date,ntz_millis_to_date:date,tz_micros_to_date:date,ntz_micros_to_date:date>
  ![-1,-1,1970-01-01,1970-01-01,1970-01-01,1970-01-01]                                                                                       [-1,-1,1970-01-01,1969-12-31,1970-01-01,1969-12-31]

We should fall back to Spark rather than produce the wrong results.

Steps to reproduce

No response

Expected behavior

No response

Additional context

No response

andygrove avatar May 24 '24 14:05 andygrove

IIRC there were differences in output between Spark 3.2 and Spark 3.4 for the timestamp_ntz type. Taking a closer look, the definition of timestamp_ntz (in Spark) essentially means that the value should be left untouched. So a value - 0 means 1970-01-01 00:00:00 in the session timezone. In the example above, the value is -1 so the correct output for timezone_ntz (millis) should be 1960-12-31 23:59:59 (ignoring the millis). Spark 3.2's answer of 1970-01-01 05:29:59 seems incorrect to me.

parthchandra avatar May 29 '24 17:05 parthchandra

I've recently been learning about the project and can be assigned me if this issue hasn't already been resolved,thanks

suibianwanwank avatar Jun 27 '24 12:06 suibianwanwank