datafusion-comet icon indicating copy to clipboard operation
datafusion-comet copied to clipboard

[EPIC] Spark-compatible cast / try_cast operations

Open andygrove opened this issue 1 year ago • 8 comments

What is the problem the feature request solves?

Comet currently delegates to DataFusion for many cast operations, and the behavior is not guaranteed to match Spark. This epic is to track fully implementing Spark-compatible cast and try_cast operations in Comet, with support for ANSI mode.

For each item in this list to be considered complete, we should have scala tests demonstrating that cast and try_cast produce the same results as Spark, both with ANSI mode enabled and disabled, using fuzz testing to find edge cases. We can update this list with links to issues as we make progress.

For cast operations that we cannot easily support with full compatibility, we should either fall back to Spark or provide a configuration that the user can enable to allow the operation to run in Comet. We should also provide documentation explaining any differences in behavior compared to Spark.

  • [ ] Cast from string to another type
    • [x] Boolean - https://github.com/apache/datafusion-comet/pull/290
    • [x] Integral Types (byte, short, int, long) - https://github.com/apache/datafusion-comet/pull/307
    • [ ] Floating-point (float, double) - https://github.com/apache/datafusion-comet/issues/326
    • [ ] Decimal - https://github.com/apache/datafusion-comet/issues/325
    • [x] Date - https://github.com/apache/datafusion-comet/issues/327
    • [ ] Timestamp - https://github.com/apache/datafusion-comet/issues/328 and https://github.com/apache/datafusion-comet/issues/376
  • [ ] Cast to string from primitive types
    • [x] Boolean - https://github.com/apache/datafusion-comet/pull/351
    • [x] Integral Types (byte, short, int, long) - https://github.com/apache/datafusion-comet/pull/351
    • [x] Floating-point (float, double) - https://github.com/apache/datafusion-comet/issues/312
    • [ ] Decimal - seems to correct but needs tests to confirm, also should fall back to Spark if spark.sql.legacy.allowNegativeScaleOfDecimal is true and scale is negative
    • [ ] Date - seems correct but needs tests to confirm
    • [ ] Timestamp - seems correct but needs tests to confirm
  • [ ] Cast between numeric types
    • [x] Integral to Integral - https://github.com/apache/datafusion-comet/issues/311
    • [x] Integral to Boolean - https://github.com/apache/datafusion-comet/pull/351
    • [ ] Integral to Decimal - https://github.com/apache/datafusion-comet/issues/2049
    • [x] Integral to Floating-point - https://github.com/apache/datafusion-comet/pull/351
    • [x] Floating-point to Boolean - https://github.com/apache/datafusion-comet/pull/351
    • [x] Floating-point to Decimal - https://github.com/apache/datafusion-comet/issues/371
    • [x] Floating-point to Integral - https://github.com/apache/datafusion-comet/issues/350
    • [ ] Decimal to Boolean - fails because arrow does not support this cast
    • [x] Decimal to Integral - same issues as https://github.com/apache/datafusion-comet/issues/350
    • [ ] Decimal to Floating-point - seems correct but needs tests to confirm
    • [x] https://github.com/apache/datafusion-comet/issues/375
  • [ ] Cast between temporal types
    • [ ] Date to boolean/int/float/decimal - results are incorrect
    • [ ] Date to Timestamp / TimestampNTZ
    • [ ] Timestamp to boolean/int/float/decimal - https://github.com/apache/datafusion-comet/issues/352
    • [ ] Timestamp to Date
  • [ ] Other
    • [x] https://github.com/apache/datafusion-comet/issues/377
    • [ ] https://github.com/apache/datafusion-comet/issues/378

In addition to the above tasks, we also need to do the following:

  • [x] Implement a mechanism where we can selectively fall back to Spark for specific cast operations (https://github.com/apache/datafusion-comet/pull/337 )
  • [ ] Write documentation that explains any differences between Comet and Spark
  • [x] https://github.com/apache/datafusion-comet/issues/374

andygrove avatar Apr 18 '24 15:04 andygrove

@andygrove do we have already a framework for fuzz testing in Scala (i.e. ScalaCheck?) Should anyone wait until you are done with the first ones so you establish a pattern?

edmondop avatar Apr 19 '24 15:04 edmondop

@andygrove do we have already a framework for fuzz testing in Scala (i.e. ScalaCheck?) Should anyone wait until you are done with the first ones so you establish a pattern?

There is a CometCastSuite and I am working on a PR right now to improve this and I am also implementing cast string to boolean (an easy one) so that there is an example for others to learn from. I should have a draft PR up on Monday.

andygrove avatar Apr 19 '24 16:04 andygrove

I am now working on cast string -> integral types. I will have a PR up later this week.

andygrove avatar Apr 22 '24 19:04 andygrove

@andygrove can I take care of this " Implement a mechanism where we can selectively fall back to Spark for specific cast operations" ? I was looking at the top of the list, but everything was quickly taken

edmondop avatar Apr 28 '24 21:04 edmondop

@edmondop feel free to pick up other items on the list that don't have issues yet (I will start filing more!)

I added an example of falling back top Spark for cast string to timestamp in https://github.com/apache/datafusion-comet/pull/337 so we do have an approach. I will update that item in this epic.

andygrove avatar Apr 29 '24 19:04 andygrove

@edmondop also just added https://github.com/apache/datafusion-comet/issues/350

andygrove avatar Apr 29 '24 20:04 andygrove