datafusion-comet icon indicating copy to clipboard operation
datafusion-comet copied to clipboard

Implement Spark-compatible CAST from String to Decimal

Open andygrove opened this issue 1 year ago • 6 comments

What is the problem the feature request solves?

What is the problem the feature request solves?

We currently delegate to DataFusion when casting from string to decimal and there are some differences in behavior compared to Spark.

  • An input of 4e7 produces 40000000.00 in Spark, and null in DataFusion
  • Inputs of ., -, + and empty string produce null in Spark, and 0.0 in DataFusion
  • Input of 0 produces 0 in Spark, and null in DataFusion
  • Arrow-rs does not support negative scale (Cannot cast string to decimal with negative scale). We could choose to fallback to Spark for this use case (or if SQLConf.LEGACY_ALLOW_NEGATIVE_SCALE_OF_DECIMAL_ENABLED is enabled)

Describe the potential solution

No response

Additional context

I used the following test in CometCastSuite to explore this.

  test("cast string to decimal") {
    val values = generateStrings(numericPattern, 5).toDF("a")
    castTest(values, DataTypes.createDecimalType(10, 2))
    castTest(values, DataTypes.createDecimalType(10, 0))
    withSQLConf((SQLConf.LEGACY_ALLOW_NEGATIVE_SCALE_OF_DECIMAL_ENABLED.key, "true")) {
      castTest(values, DataTypes.createDecimalType(10, -2))
    }
  }

Describe the potential solution

No response

Additional context

No response

andygrove avatar Apr 25 '24 13:04 andygrove

Hi, I'd like to contribute to this!

kevinmingtarja avatar Apr 26 '24 05:04 kevinmingtarja

Thanks @kevinmingtarja.

You can take a look at @andygrove's PR as a reference https://github.com/apache/datafusion-comet/pull/307

viirya avatar Apr 26 '24 06:04 viirya

Current state for reference:

scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._

scala> val inputs = Seq("", "0", "1", "+1.0", ".34", "-10.0", "4e7").toDF("n")
inputs: org.apache.spark.sql.DataFrame = [n: string]

scala> inputs.write.mode("overwrite").parquet("test.parquet")
24/04/28 11:46:51 INFO src/lib.rs: Comet native library initialized
                                                                                
scala> val df = spark.read.parquet("test.parquet")
df: org.apache.spark.sql.DataFrame = [n: string]

scala> val df2 = df.withColumn("converted", col("n").cast(DataTypes.createDecimalType(10, 2)))
df2: org.apache.spark.sql.DataFrame = [n: string, converted: decimal(10,2)]

scala> df2.show
+-----+---------+
|    n|converted|
+-----+---------+
|-10.0|   -10.00|
| +1.0|     1.00|
|  .34|     0.34|
|  4e7|     null|
|    1|     1.00|
|    0|     0.00|
|     |     0.00|
+-----+---------+


scala> spark.conf.set("spark.comet.enabled", false)

scala> df2.show
+-----+-----------+
|    n|  converted|
+-----+-----------+
|-10.0|     -10.00|
| +1.0|       1.00|
|  .34|       0.34|
|  4e7|40000000.00|
|    1|       1.00|
|    0|       0.00|
|     |       null|
+-----+-----------+

Note: I encountered a java.lang.AssertionError: assertion failed: Decimal$DecimalIsFractional on the second df2.show, but it seems like a known issue and can safely be ignored: https://kb.databricks.com/scala/decimal-is-fractional-error

kevinmingtarja avatar Apr 28 '24 05:04 kevinmingtarja

Hi @kevinmingtarja, are you working on this issue? If not, I would like to work on it. Thank you.

sujithjay avatar Jun 04 '24 00:06 sujithjay

Hi @kevinmingtarja, are you working on this issue? If not, I would like to work on it. Thank you.

Hey, i don't think i have the bandwidth rn to complete this, so please feel free to work on it. I have made some progress here on a branch in my fork, so feel free to take inspirations from there as well if needed!

kevinmingtarja avatar Jun 05 '24 17:06 kevinmingtarja

Hi @sujithjay, let me know if you are still working on this. I am happy to continue #615 and add you as a co-author. Thank you

justahuman1 avatar Sep 26 '24 01:09 justahuman1

I was looking at this issue and now - with comet enabled and comet disabled shows same result -


+-----+-----------+
|    n|  converted|
+-----+-----------+
|-10.0|     -10.00|
| +1.0|       1.00|
|  .34|       0.34|
|  4e7|40000000.00|
|    1|       1.00|
|    0|       0.00|
|     |       null|
+-----+-----------+


scala> spark.conf.set("spark.comet.enabled", false)

scala> df2.show
+-----+-----------+
|    n|  converted|
+-----+-----------+
|-10.0|     -10.00|
| +1.0|       1.00|
|  .34|       0.34|
|  4e7|40000000.00|
|    1|       1.00|
|    0|       0.00|
|     |       null|
+-----+-----------+

But even with comet enabled it does not go through DataFusion as it is marked as incompatible, I then checked this in the DataFusion project the following queries

query TR
select arrow_typeof(cast(4e7 as decimal(10,2))), cast(4e7 as decimal(10,2));
----
Decimal128(10, 2) 40000000

works fine but the following one if I add a '4e7' as shown below it throws arrow-error

External error: query failed: DataFusion error: Arrow error: Cast error: Cannot cast string '4e7' to value of Decimal128(38, 10) type

[SQL] select arrow_typeof(cast('4e7' as decimal(10,2))), cast('4e7' as decimal(10,2));

I think, if we add the functionality for string of this format to decimal in arrow-rs, then the above error could be resolved and also this issue will be addressed. Previous attempt to fix this issue was made on DataFusion project. Can someone please help me understand whether the fix can be applied in arrow-rs and use the cast from there?

himadripal avatar Dec 09 '24 04:12 himadripal

I think, if we add the functionality for string of this format to decimal in arrow-rs, then the above error could be resolved and also this issue will be addressed. Previous attempt to fix this issue was made on DataFusion project. Can someone please help me understand whether the fix can be applied in arrow-rs and use the cast from there?

It makes sense to me for arrow-rs to support casting 4e7 as decimal. There was actually a discussion about this in https://github.com/apache/datafusion/issues/10315

andygrove avatar Dec 11 '24 20:12 andygrove

take, this arrow-pr should fix one.

himadripal avatar Dec 27 '24 00:12 himadripal