datafusion-comet icon indicating copy to clipboard operation
datafusion-comet copied to clipboard

Cast String to Date ANSI Mode - Spark 3.2 - Mismatch between Spark and Comet Errors

Open vidyasankarv opened this issue 1 year ago • 2 comments

Describe the bug

When a String which is an invalid date is cast to a Datetype

In spark 3.2 the error message is

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.10 executor driver): java.time.DateTimeException: Cannot cast 0 to DateType.

In spark 3.3 and above the error message is :

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.10 executor driver): org.apache.spark.SparkDateTimeException: [CAST_INVALID_INPUT] The value '0' of the type "STRING" cannot be cast to "DATE" because it is malformed. Correct the value as per the syntax, or change its target type. Use `try_cast` to tolerate malformed input and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error.

Currently in Comet the error messages match to spark 3.3 and above

Steps to reproduce

In the CometTestSuite cast StringType to DateType test we have added an assumption for this test to be only running in Spark3.3 and above. Removing that triggers a test failure when the test suite is run on with the following env jdk-1.8 and spark-3.2.0

Additionally you can reproduce this error locally using spark shell setup with jdk 1.8 and spark 3.2.0

$SPARK_HOME/bin/spark-shell --conf spark.sql.ansi.enabled=true

import org.apache.spark.sql._  
import org.apache.spark.sql.types._  
  
import java.io.File  
import java.nio.file.Files  


  def roundtripParquet(df: DataFrame): DataFrame = {  
    val tempDir = Files.createTempDirectory("spark").toString  
    val filename = new File(tempDir, s"castTest_${System.currentTimeMillis()}.parquet").toString  
    df.write.mode(SaveMode.Overwrite).parquet(filename)  
    spark.read.parquet(filename)  
  }  
  
  import spark.implicits._  
  
  val data = roundtripParquet(Seq("0").toDF("a"))  
  data.createOrReplaceTempView("t")  
  val df = spark.sql(s"select a, cast(a as ${DataTypes.DateType.sql}) from t order by a")  
  df.collect().foreach(println) 

Expected behavior

CometTestSuite cast String to DateType test should pass for spark-3.2.0

Additional context

https://github.com/apache/datafusion-comet/pull/383#issuecomment-2115341055

vidyasankarv avatar May 17 '24 04:05 vidyasankarv

Is this an issue of just a mismatch between error messages? Or is the cast actually not doing the right thing with Spark 3.2?

parthchandra avatar May 20 '24 23:05 parthchandra

Is this an issue of just a mismatch between error messages? Or is the cast actually not doing the right thing with Spark 3.2?

Is an issue with mismatch between error messages. - @andygrove we skip fixing that for now as its not a high priority and create a ticket instead https://github.com/apache/datafusion-comet/pull/383#issuecomment-2115341055

vidyasankarv avatar May 21 '24 11:05 vidyasankarv

We can close this now that we no longer support Spark 3.2. Thanks @vidyasankarv

andygrove avatar Jun 26 '24 16:06 andygrove