datafusion-comet icon indicating copy to clipboard operation
datafusion-comet copied to clipboard

Different semantics of casting from int64 to timestamp between Comet and Spark

Open viirya opened this issue 1 year ago • 7 comments

What is the problem the feature request solves?

In Spark, casting from LongType to Timestamp is to casting seconds to microseconds. arrow-rs's cast kernel basically simply reinterprets Int64Array as TimestampMicrosecondArray. So the scale is different, i.e., seconds v.s. microseconds.

We need to add this special case into Comet's cast kernel.

Describe the potential solution

No response

Additional context

No response

viirya avatar Feb 29 '24 20:02 viirya

This cast is currently disabled by default, but it would be good to implement this. I have added this to the 0.2.0 milestone.

andygrove avatar Jul 25 '24 13:07 andygrove

I can take this

justahuman1 avatar Sep 26 '24 01:09 justahuman1

Thanks @justahuman1!

A good place to start is org.apache.comet.expressions.CometCast#canCastFromLong. This currently returns false for casting to timestamp, so enabling it here is the first step.

Next, you can enable the following test in CometCastSuite:

  ignore("cast LongType to TimestampType") {
    // java.lang.ArithmeticException: long overflow
    castTest(generateLongs(), DataTypes.TimestampType)
  }

andygrove avatar Sep 29 '24 14:09 andygrove

Also check CometExpressionSuite.test("cast timestamp and timestamp_ntz"). This reads timestamps as longs from a parquet file (which may store the values as either millis or micros).

parthchandra avatar Sep 30 '24 21:09 parthchandra

A better test would be something like this:

  test("cast LongType to TimestampType") {
    // input value is interpreted as milliseconds but will be cast to microseconds
    val min = -9223372036854775L
    val max =  9223372036854775L
    val values = gen.generateLongs(dataSize)
      .filter(x => x >= min && x <= max) ++ Seq(min, max)
    castTest(withNulls(values).toDF("a"), DataTypes.TimestampType)
  }

However the min value here is actually causing an overflow in Spark and I am not sure why so this needs some more research.

andygrove avatar Apr 03 '25 14:04 andygrove

the min value here is actually causing an overflow in Spark

Probably because Spark is converting the value from millis to micros?

parthchandra avatar Apr 03 '25 17:04 parthchandra

the min value here is actually causing an overflow in Spark

Probably because Spark is converting the value from millis to micros?

Right, my understanding was that the minimum valid value for micros would be -9223372036854775808 (Long.MinValue) so dividing that by 1000 to convert to millis would be -9223372036854775. The actual min value supported in the cast appears to be at least one order of magnitude smaller than that though.

andygrove avatar Apr 03 '25 17:04 andygrove