Implement Spark-compatible CAST from String to Decimal
What is the problem the feature request solves?
What is the problem the feature request solves?
We currently delegate to DataFusion when casting from string to decimal and there are some differences in behavior compared to Spark.
- An input of
4e7produces40000000.00in Spark, andnullin DataFusion - Inputs of
.,-,+and empty string producenullin Spark, and0.0in DataFusion - Input of
0produces0in Spark, andnullin DataFusion - Arrow-rs does not support negative scale (
Cannot cast string to decimal with negative scale). We could choose to fallback to Spark for this use case (or ifSQLConf.LEGACY_ALLOW_NEGATIVE_SCALE_OF_DECIMAL_ENABLEDis enabled)
Describe the potential solution
No response
Additional context
I used the following test in CometCastSuite to explore this.
test("cast string to decimal") {
val values = generateStrings(numericPattern, 5).toDF("a")
castTest(values, DataTypes.createDecimalType(10, 2))
castTest(values, DataTypes.createDecimalType(10, 0))
withSQLConf((SQLConf.LEGACY_ALLOW_NEGATIVE_SCALE_OF_DECIMAL_ENABLED.key, "true")) {
castTest(values, DataTypes.createDecimalType(10, -2))
}
}
Describe the potential solution
No response
Additional context
No response
Hi, I'd like to contribute to this!
Thanks @kevinmingtarja.
You can take a look at @andygrove's PR as a reference https://github.com/apache/datafusion-comet/pull/307
Current state for reference:
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._
scala> val inputs = Seq("", "0", "1", "+1.0", ".34", "-10.0", "4e7").toDF("n")
inputs: org.apache.spark.sql.DataFrame = [n: string]
scala> inputs.write.mode("overwrite").parquet("test.parquet")
24/04/28 11:46:51 INFO src/lib.rs: Comet native library initialized
scala> val df = spark.read.parquet("test.parquet")
df: org.apache.spark.sql.DataFrame = [n: string]
scala> val df2 = df.withColumn("converted", col("n").cast(DataTypes.createDecimalType(10, 2)))
df2: org.apache.spark.sql.DataFrame = [n: string, converted: decimal(10,2)]
scala> df2.show
+-----+---------+
| n|converted|
+-----+---------+
|-10.0| -10.00|
| +1.0| 1.00|
| .34| 0.34|
| 4e7| null|
| 1| 1.00|
| 0| 0.00|
| | 0.00|
+-----+---------+
scala> spark.conf.set("spark.comet.enabled", false)
scala> df2.show
+-----+-----------+
| n| converted|
+-----+-----------+
|-10.0| -10.00|
| +1.0| 1.00|
| .34| 0.34|
| 4e7|40000000.00|
| 1| 1.00|
| 0| 0.00|
| | null|
+-----+-----------+
Note: I encountered a java.lang.AssertionError: assertion failed: Decimal$DecimalIsFractional on the second df2.show, but it seems like a known issue and can safely be ignored: https://kb.databricks.com/scala/decimal-is-fractional-error
Hi @kevinmingtarja, are you working on this issue? If not, I would like to work on it. Thank you.
Hi @kevinmingtarja, are you working on this issue? If not, I would like to work on it. Thank you.
Hey, i don't think i have the bandwidth rn to complete this, so please feel free to work on it. I have made some progress here on a branch in my fork, so feel free to take inspirations from there as well if needed!
Hi @sujithjay, let me know if you are still working on this. I am happy to continue #615 and add you as a co-author. Thank you
I was looking at this issue and now - with comet enabled and comet disabled shows same result -
+-----+-----------+
| n| converted|
+-----+-----------+
|-10.0| -10.00|
| +1.0| 1.00|
| .34| 0.34|
| 4e7|40000000.00|
| 1| 1.00|
| 0| 0.00|
| | null|
+-----+-----------+
scala> spark.conf.set("spark.comet.enabled", false)
scala> df2.show
+-----+-----------+
| n| converted|
+-----+-----------+
|-10.0| -10.00|
| +1.0| 1.00|
| .34| 0.34|
| 4e7|40000000.00|
| 1| 1.00|
| 0| 0.00|
| | null|
+-----+-----------+
But even with comet enabled it does not go through DataFusion as it is marked as incompatible, I then checked this in the DataFusion project the following queries
query TR
select arrow_typeof(cast(4e7 as decimal(10,2))), cast(4e7 as decimal(10,2));
----
Decimal128(10, 2) 40000000
works fine but the following one if I add a '4e7' as shown below it throws arrow-error
External error: query failed: DataFusion error: Arrow error: Cast error: Cannot cast string '4e7' to value of Decimal128(38, 10) type
[SQL] select arrow_typeof(cast('4e7' as decimal(10,2))), cast('4e7' as decimal(10,2));
I think, if we add the functionality for string of this format to decimal in arrow-rs, then the above error could be resolved and also this issue will be addressed. Previous attempt to fix this issue was made on DataFusion project. Can someone please help me understand whether the fix can be applied in arrow-rs and use the cast from there?
I think, if we add the functionality for string of this format to decimal in arrow-rs, then the above error could be resolved and also this issue will be addressed. Previous attempt to fix this issue was made on DataFusion project. Can someone please help me understand whether the fix can be applied in arrow-rs and use the cast from there?
It makes sense to me for arrow-rs to support casting 4e7 as decimal. There was actually a discussion about this in https://github.com/apache/datafusion/issues/10315
take, this arrow-pr should fix one.